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1  Project  Summary 


This  research  is  an  investigation  of  the  application  of  coincident  pulse  tech¬ 
niques  to  multiprocessor  interconnection  networks.  The  research  focuses  on 
three  main  areas:  an  examination  of  the  applicability  of  coincident  pulse 
techniques  and  required  hardware  to  multiprocessor  applications,  an  inves¬ 
tigation  of  the  limits  of  scalability,  and  an  exploration  of  various  intercon¬ 
nection  structures  which  can  be  created  using  these  techniques. 


2  Project  Objectives 


Specific  objectives  of  this  research  are: 

•  The  determination  of  the  limits  placed  by  current  technology  on  implementations  of  coincident 
pulse  structures.  These  limits  include,  pulse  width,  detection  limits,  degree  of  overlap  for 
coincidence,  power  distribution  and  pulse  synchronization. 

•  The  study  of  specific  structures  which  are  capable  of  supporting  simulcasting  and  multicas¬ 
ting  communications  and  the  comparison  of  these  structures  with  functionally  comparable 
electronic  systems. 

•  The  resolution  of  specific  configuration  issues  related  to  clock  distribution  mechanisms  for 
latching  data  in  the  simulcasting  structure. 

•  The  characterization  of  the  tradeoff  between  complexity  and  latency  for  choosing  the  number 
of  dimensions  appropriate  for  a  particular  coincident  structure. 

•  The  study  of  techniques  for  error  detection  and  recovery  in  two  dimensional  structures. 


3  Project  Status 

Several  of  the  specific  objectives  listed  above  were  met  in  the  first  year  of  our  project: 

•  .An  investigation  of  the  limits  placed  by  current  technology  on  the  coincident  pulse  technique. 

•  Power  and  distribution  issues  for  specific  bus  configurations. 

•  Investigations  of  specific  configurations  for  multicasting  and  simulcasliiig. 

.Vs  noted  in  the  publications  listed  below,  and  reproduced  in  the  appendix  (he  second  year  has 
seen  progress  on  the  following  objectives: 

•  The  identification  of  the  limits  to  scalability  for  these  systems. 
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•  The  quantification  and  resolution  of  the  “shadow  problem”  in  linear  and  multi-dimensional 
structures. 

•  The  generalization  of  the  coincident  structures  to  pipelined  bus  structures  and  the  analysis 
of  inherent  advantages  of  pipelined  communication  structures  for  both  optical  and  electronic 
interconnections. 


Several  of  the  specific  objectives  have  been  modified  based  on  our  research.  In  particular: 


•  We  see  a  definite  need  for  active  amplification  in  the  tapped  bus  structures.  Therefore  we 
are  investigating  the  use  of  non-linear  (  Erbium  doped)  fiber  to  create  a  “lossless”  tapped  bus 
structure. 

•  The  application  of  the  signal  pipelining  results  to  reconfigurable  time  division  multiplexed 
structures. 


Our  next  set  of  specific  goals  are: 


•  The  generalization  of  our  work  in  TDM  structures  to  more  general  reconfigurable  optical 
interconnection  networks 

•  The  identification  of  bandwidth  as  a  virtual  resource  which  can  be  aUocated  dynamically  on 
a  variety  of  networks. 

•  The  use  of  locality  in  source-destination  address  pairs  to  provide  a  mechanism  of  providing  a 
dynamic  reconfiguration  mechanism  which  provides  channels  at  optical  message  speed,  while 
optimizing  resources  at  computer  algorithm  speeds. 


The  next  sections  summarize  some  of  our  recent  contributions,  which  have  not  yet  appeared  in 
the  open  literature.  Other  contributions  are  reported  in  the  papers  given  in  the  Appendix  of  this 
report. 
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4.  INTERCONNECTION  NETWORKS  REVISITED 


Interconnection  networks  provide  physical  connections  for  communications  in  multiprocessor  sys¬ 
tems.  Often,  for  technological  and  economical  reasons,  an  interconnection  network  has  limited  connec¬ 
tivity.  and  connection  paradigms  that  enhance  connectivity  need  to  be  developed.  The  complexities  of 
hardware  and  control  of  the  network  and  the  communication  efficiency  depend  on  both  the  topology  of 
the  network  and  the  connection  paradigms. 

In  multiprocessor  systems,  a  processor  may  communicate  with  other  processors  from  time  to  time, 
but  not  all  of  the  time.  In  other  words,  an  application  program  may  need  a  set  of  connections,  but  not  all 
of  them  will  be  used  at  the  same  time.  Therefore,  it  may  be  neither  feasible,  nor  efficient  to  establish  all 
connections  at  all  times.  Instead,  establishments  of  required  connections  may  be  interleaved  such  that 
each  subset  of  connections  is  alternately  established  for  a  fixed  period  of  time  called  a  time  slot.  That  is, 
the  available  bandwidth  of  the  interconnection  network  may  be  shared  among  these  connections  in  a  time 
division  multiplexed  way.  We  call  this  connection  paradigm  /?econfiguration  with  Time  Division  Multi¬ 
plexing  (RTDM). 

4.1.  RTDM  IN  MULTISTAGE  INTERCONNECTION  NETWORKS  (MINs) 

Let  the  set  of  the  input  ports  and  the  set  of  the  output  ports  of  a  V  XiV  MIN  be  /  and  0  respectively. 

where/  =0  =  (0,  1 . A/-1 }.  A  path  in  the  MIN  between  <  €  /  and  y  e  0  is  denoted  by  p,  y ) 

e  1x0 .  Define  a  mapping,  M.  to  be  a  set  of  paths  that  can  be  established  at  the  same  time  without 
conflicts  in  the  MIN.  More  specifically. 

M  =  {p,  I  all  p,  can  be  established  at  the  same  time  without  conflicts,  where  0  <  / ,  y  <  N-l  } 

Note  that,  an  admissible  (or  permissible)  permutation  is  a  mapping  that  contains  N  paths.  We  will  refer  to 
mappings  that  contain  less  than  N  paths  as  partial  mappings.  Since  establishment  of  two  paths  at  the 
same  time  may  cause  conflicts,  not  every  set  of  paths  is  a  mapping.  We  refer  to  the  establishment  of  all 
the  paths  in  a  mapping  as  the  realization  of  the  mapping. 

Given  a  mapping,  there  is  a  way  to  set  switches  in  the  MIN  to  realize  the  mapping.  Let  m  =  ^  be 

the  number  of  switches  per  stage,  and  let  n  =  logN  be  the  number  of  stages  in  the  MIN.  Define  a  sH'itch 
setting  array  to  be  an  array,  whose  /-th  element  at  y  -lh  column  corresponds  to  the  /  -Ih  switch  at  y  - 
th  stage  in  the  MIN.  Denote  the  switch  .setting  array  of  a  mapping  M  by  SS  (M )  and  its  elements  by 
SS  (M  )<i .  j  >  where  I  <i  <m  and  I  <j  <n.  The  value  of  an  element  in  SS(M)  is  "0"  or  "1"  if  the 
corresponding  switch  has  to  be  set  to  "straight”  or  "cross"  to  realize  mapping  M .  The  value  of  an  element 
is  "x"  if  the  corresponding  switch  can  be  in  any  state  without  affecting  the  realization  of  the  mapping,  in 
which  case,  the  mapping  must  be  a  partial  mapping.  Two  mappings  M  |  and  M2,  are  said  to  be  not  com¬ 
patible  with  each  other,  if  there  are  some  i  and  y  such  that  the  two  elements.  .'?5(M,)</ , y  >  and 
SS  (M2)<i ,  j  >.  are  either  "0"  or  "1"  but  not  equal.  That  is.  M,  and  M:  are  not  compatible  if  the 
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realization  of  both  mappings  at  the  same  time  will  cause  conflicts  in  switch  setting.  Otherwise.  M  i  and 
Ml  said  to  be  compatible  with  each  other,  in  which  case,  the  two  mappings  can  be  merged  into  one 
mapping,  namely  M  =M\\j  Mz- 


Input  Output 


Stage  1  2  3 


Straight  cross 

=i=  ^ 


Figure  1 .  An  8x8  generalized  cube  network 


MINs  under  consideration  in  this  report  are  generalized  cube  networks,  which  are  topologically 
equivalent  to  many  blocking  MINs.  AnNxN  generalized  cube  network  has  n  (where  N  =2”)  stages  of 
2x2  switches  with  cube-type  connections.  Figure  I  shows  an  example  of  such  a  MIN  with  A  =  8  in 
which  stages  are  numbered  1  to  «  =  3  from  left  to  right.  Each  switch  is  assumed  to  have  two  states  : 
straight  or  cross,  as  shown  in  the  figure.  Three  switch  control  strategies  are  possible  for  this  type  of 
MINs.  Individual  switch  control  assumes  one  control  signal  per  switch.  Individual  stage  control  assumes 
one  control  signal  per  stage  and  partial  stage  control  assumes  /-i-l  control  signals  in  stage  i .  In  general, 
individual  switch  control  is  used  since  it  yields  more  powerful  connectivities  in  a  MIN. 

Given  a  set  of  paths  P  qIxO,  it  may  not  be  possible  to  establish  all  paths  in  P  at  the  same  time 
without  conflicts.  However.  P  can  be  partitioned  into  several  mappings,  P  =  M  \  M  zkj  ’  KJ 

Each  mapping  M,,  i  =  1.2 . r .  may  be  realized  for  a  fixed  length  of  time,  which  we  call  a  time  slot.  By 

doing  so,  every  path  in  P  is  established  once  in  a  time  slot  and  P  is  said  to  be  realized  through  hme- 
division  multiplexing.  Note  that,  the  switch  setting  arrays  for  different  mappings  are  usually  different. 
Therefore,  the  MIN  has  to  change  its  switch  setting  after  each  time  slot. 
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We  call  a  MIN  a  Time-Division  Multiplexed  MIN  (TDM-MIN)  if  it  repeatedly  realizes  a  sequence  of 
mappings  in  a  time-division  multiplexed  way.  More  specifically,  a  r-way  TDM-MIN  changes  its  switch 

setting  after  each  time  slot  to  realize  one  of  t  mappings  A/ 1,  M2 . M,  in  a  round-robin  fashion.  Without 

loss  of  generality,  we  assume  that  M,  is  realized  during  the  / -th  time  slot  (1  <  /  <  r).  We  call  the  ordered 

sequence  [Mi,  M2 . M,]  a  configuration  of  the  r-way  TDM-MIN  and  the  number  of  mappings  in  the 

sequence,  r,  the  Multiplexing  Cycle  Length  (MCL)  of  the  configuration.  Note  that,  this  definition  of 
configuration  of  a  MIN  is  different  from  the  conventional  one  that  defines  a  configuration  of  a  MIN  as  a 
set  of  numberings  of  input  and  output  ports  within  a  given  MIN  topology. 

Once  a  configuration  of  a  TDM-MIN  has  been  determined,  each  mapping  and  its  corresponding 
switch  setting  array  in  each  time  slot  are  determined.  That  is,  in  each  time  slot,  the  output  port  to  which 
any  given  input  port  is  connected  is  known.  So  is  the  state  to  which  any  given  switch  should  be  set.  A 
global  clock  may  be  used  to  synchronize  all  input  ports  and  switches  in  the  MIN  at  the  begirming  of  each 
time  slot.  Each  input  port  maintains  a  list  of  output  ports  to  which  it  is  connected  during  different  time 
slots  of  a  multiplexing  cycle.  More  specifically,  the  /fe-th  entry  in  the  list  of  source  node  /  is  j  if  Pi^j  e 
Mk-  Each  switch  is  assumed  to  have  a  shift  register  whose  size  is  no  less  than  the  multiplexing  cycle 
length.  / ,  of  a  configuration.  The  sequence  of  t  states  that  a  switch  should  be  set  to  is  stored  in  the  shift 
register.  The  ^-th  bit  of  the  shift  register  of  the  i  -th  switch  at  stage  j  is  either  ”0"  or  "1"  if  the  correspond¬ 
ing  element  of  SS (A/*)  is  "x".  Otherwise,  it  should  be  equal  to  the  value  of  its  corresponding  element  of 
SSiMk). 

At  the  beginning  of  each  time  slot,  every  switch  is  set  to  the  state  specified  by  the  content  of  its  shift 
register.  After  switches  are  set  properly,  an  input  port  can  transmit  a  message  to  the  output  port  to  which 
it  is  connected  in  this  time  slot.  Note  that,  if  individual  stage  control  is  used,  only  one  shift  register  per 
stage  is  required. 

In  MIMD  environments,  a  MIN  is  commonly  either  circuit-switched  or  packet-switched  or  a  combi¬ 
nation  of  both.  In  circuit-switching,  only  a  limited  number  of  circuits  can  be  established  without 
conflicts.  Dynamically  setting  up  or  releasing  a  circuit  involves  run  time  overheads.  In  packet  switching, 
having  one  routing  tag  for  each  packet  also  introduces  run  time  overheads.  In  addition,  switches  are  more 
complex  since  they  need  to  do  buffering  and  arbitrations  based  on  routing  tags  of  incoming  packets. 

In  a  TDM-MIN  with  multiplexing  cycle  length  r ,  up  to  tN  different  connections  can  be  established. 
If  a  set  of  connections  required  by  an  application  is  known,  a  MIN  can  be  set  to  a  TDM  configuration  stat¬ 
ically.  This  means  that  after  execution  begins,  the  time  slot  in  which  each  connection  is  established  is 
predetermined  and  routing  decisions  arc  as  simple  as  waiting  for  the  appropriate  time  slots.  As  a  result, 
messages  do  not  have  to  contain  routing  information  such  as  destination  addresses,  nor  do  they  need  to  be 
buffered  at  any  intermediate  switch.  Overheads  due  to  arbitration  and  path  conflicts  in  circuit  or  packet 
switchings  are  eliminated  at  run  time. 
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Even  in  applications  that  require  dynamically  changes  of  connections,  the  overall  communication 
pattern  is  expected  to  change  slowly  during  execution.  Dynamic  reconfiguration  may  affect  one  or  more 
mappings  but  most  of  the  other  mappings  will  remain  intact.  As  a  result,  overheads  due  to  dynamic  estab¬ 
lishments  or  releases  of  connections  are  relatively  lower  than  using  circuit  switching.  In  essence,  the 
RTDM  connection  paradigm  takes  advantage  of  the  relative  stability  of  communication  patterns  to  sim¬ 
plify  control  and  to  reduce  overheads.  On  the  other  hand,  however,  the  multiplexing  degree  in  a  TDM- 
MIN  affects  the  latency  of  a  cormection.  which  must  be  kept  low  to  achieve  high  communication 
efficiency. 

42.  STATIC  RECONHCURATIONS 
42.1.  Connection  Request  Graphs 

Communication  requirements  of  an  application  can  often  be  obtained  as  a  result  of  compile  time 
analysis.  After  data  allocation  and  processor  assignment  are  done,  memory  access  patterns  or  inter¬ 
processor  communication  patterns  can  be  represented  by  a  bipartite  graph,  which  we  call  Connection 
Request  (or  CR)  graph. 


Figure  2.  Examples  of  Connection  Request  (CR)  Graphes 

Figure  2  shows  two  examples  of  CR  graphs,  one  for  .shared  memoi7  systems  and  another  for  mes¬ 
sage  passing  systems.  Figure  2(a)  shows  a  CR  graph  based  on  processor  to  memory  connections.  A 
directed  edge  from  processor  /  at  the  top  to  memory  module  j  at  the  bottom  means  that  processor  i  may 
address  memory  module  j .  On  the  other  hand.  Figure  2(b)  shows  a  (TR  graph  based  on  inter-procc.s.sor 
connections.  A  directed  edge  from  a  source  processor  i  at  the  top  to  a  destination  procc.ssor  j  at  the  bot¬ 
tom  meaas  that  procc.ssor  /  may  send  me.ssages  to  procc.s.sor  j .  An  edge  from  processor  /  to  itself  is 
meaningless  in  (2R  graphs  for  interprocessor  connections.  Note  that,  due  to  the  dynamic  nature  of 
memory  access  requests  in  many  applications,  it  is  relatively  dillicult  to  construct  CR  graphs  tor  shared 
memory  systems. 
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A  node  in  a  CR  graph  is  called  either  a  source  node  or  a  destination  node.  Let  source  node  i , 
0  <  /  <  A-l,  use  input  port  i  of  an  A xN  MIN  and  let  destination  node  J ,  0  <  J  <  N-l,  use  output  port  j 
of  the  same  MIN.  Therefore,  an  edge  from  source  node  i  to  destination  node  j  in  a  CR  graph  requires  the 
establishment  of  a  path  pi^j  in  the  MIN.  We  will  use  the  same  notation  for  a  path  to  denote  an  edge  and 
use  the  terms  "edge"  and  "path"  interchangeably. 

Denote  the  set  of  all  edges  in  the  CR  graph  by  E.  and  the  number  of  edges  in  the  set  by  I  £  !.  As  an 
example,  the  set  E  in  the  CR  graph  in  Figure  2(b).  with  I  £  I  =  12.  is  given  in  Eq.  3. 1  below. 

£  =  f(0,l).  (1.0).  (1.3).  (2,1).(2.3).  (3.2).  (4.5).  (5,4).  (5,6).  (6.7),  (7.5).  (1.6)}  (3.1) 

Note  that,  any  directed  or  undirected  communication  graph  can  be  converted  into  a  corresponding  bipar¬ 
tite  CR  graph.  The  number  of  edges  going  out  from  a  node  is  called  the  out-degree  of  the  node  and  the 
number  of  edges  coming  into  a  node  is  called  the  in-degree  of  the  node.  We  will  refer  the  maximum  of 
the  out-degree  and  the  in-degree  of  a  node  as  the  degree  of  a  node. 

Given  a  CR  graph,  we  call  a  configuration  [A/i.Afi . M, ]  a  Minimal  Connection  (or  MC) 

configuration  for  the  CR  graph  if  it  satisfies  the  following  two  conditions. 

(1) .  E 

(2) .  for  any  i.j  e  (1.2 . r  ( ,  4/,  and  4/^  are  not  compatible. 

The  first  condition  states  that  any  edge  (i.y)  e  £  is  established  in  a  mapping  M,  and  the  second  condi¬ 
tion  states  that  any  two  mappings  in  the  configuration  can  not  be  merged  together.  We  call  a 
configuration  for  a  CR  graph  optimal  if  it  is  an  MC  configuration  for  the  graph  and  it  has  the  least  multi¬ 
plexing  cycle  length  among  all  other  MC  configurations  for  the  same  graph.  Note  that,  if  r  is  equal  to  the 
maximum  degrees  of  nodes  in  a  CR  graph,  then  the  MC  configuration  is  optimal. 

4J.2.  Embeddings  of  Regular  Communication  Structures 

When  the  communication  structure  of  an  application  is  regular,  finding  a  configuration  for  its  CR 
graph  is  often  called  embedding.  Note  that,  the  ability  to  embed  regular  communication  structures 
efficiently  is  important  since  there  are  many  existing  applications  designed  for  them.  The  multiplexing 
cycle  length  r  of  a  configuration  is  a  measure  of  the  efficiency  of  the  embedding.  This  measure  is.  in 
some  sense,  similar  to  the  dilation  co.st  in  conventional  embeddings.  We  also  define  path  utilization(P\J) 
to  be  the  ratio  of  the  number  of  connections  required  versus  the  number  of  connections  that  can  be  esta- 

I F  I 

blished  in  one  multiplexing  cycle.  That  is.  PU  =  . 

Since  there  are  iV‘  paths  between  every  input  port  and  every  output  port  in  a  comple'ely  connected 
CR  graph  and  at  most  N  paths  can  be  established  in  each  mapping,  an  MC  configuration  that  embeds  a 
completely  connected  network  has  at  least  N  different  mappings.  We  call  a  configuration 
[M  \.  Mz . Ms  1  such  that 
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(3.2) 


N 

=  I 

a  completely  connected  (CC)  configuration  since  every  path  in  a  completely  connected  network  is  esta¬ 
blished  in  one  of  the  N  mappings  of  the  configuration.  Such  an  embedding  is  clearly  an  optimal  one  with 
its  multiplexing  cycle  length  t  =N  and  path  utilizafi'^n  PU  ~\.  There  are  more  than  one  CC 
configurations.  As  one  example,  define 

Ms(k)  =  {Pt-^,  I  y  =(i  +k)moa  N  for  i  =0.  1 . N-lj  (3.3) 

and  call  it  a  shift-k  mapping.  Therefore,  the  configuration  /V/j(i) . establishes  paths 

from  any  input  port  to  all  A/  output  ports  and,  thus,  is  a  CC  configuration.  As  another  example,  define 

{Pi^j  I  y  =  '  xor  k  for  i  =o,  1 . N-\}  (3.4) 

and  call  it  a  Jiip-k  mapping  where  xor  is  the  bit-wise  Exclusive-OR  operation.  The  flip-k  mapping  can 

be  realized  by  individual  stage  control.  Therefore  the  configuration  (  Mf{\) . 1  's  also  a 

CC  configuration.  Note  that.  CC  configurations  are  functionally  equivalent  in  terms  of  their  multiplexing 
cycle  lengthes  and  path  utilizations.  However,  the  CC  configuration  with  flip-k  mappings  may  be 
chosen  for  the  purpose  of  embedding  a  completely  connected  network  in  the  time  domain  due  to  its  con¬ 
trol  simplicity.  It  is  also  worth  noting  that  in  the  case  of  processor-to-processor  interconnection,  the  iden¬ 
tity  mappings,  such  as  Ms{0)  or  that  establish  no  paths  other  than  those  from  a  node  to  itself  can  be 
deleted  from  the  CC  configurations. 

Since  any  CR  graph  is  a  subgraph  of  a  completely  connected  graph,  it  can  be  embedded  in  a  .s  CC 
configuration  of  a  TDM-MIN.  This,  however,  requires  an  Ai-way  TDM-MIN  and  thus  may  be  inefficient 
in  tenns  of  both  the  multiplexing  cycle  length  and  the  path  utilization.  An  alternative  is  to  find  an  MC 
configuration  of  length  t  <  N  which  embeds  the  CR  graph.  In  other  words,  a  r-way  TDM-MIN  instead 
of  an  /V  -way  TDM-MIN.  for  some  t  <  N .  can  be  used  to  increase  the  embedding  efficiency.  The  smaller 
the  ratio  of  is.  the  more  efficient  it  is  to  use  such  an  MC  configuration.  Table  1  summarizes  embed¬ 
ding  results  of  several  regular  communication  structures. 

4J.3.  Static  Reconfigurations  Based  On  Arbitrary  CR  Graphs 

For  non-regular  CR  graphs,  an  MC  configuration  can  always  be  obtained  by  selecting  a  subset  of 
mappings  from  a  CC  configuration.  Such  an  MC  configuration  can  often  improve  perfbnmance  of  appli¬ 
cations  by  reducing  the  multiplexing  cycle  length  to  t  time  slots,  for  some  t  <  N .  Note  that,  given  a 
specific  path,  there  is  only  one  mapping  in  a  CC  configuration  that  establishes  that  path.  The  mapping  can 
usually  be  determined  by  either  a  simple  arithmetic  operation  or  a  table  look-up.  For  example,  if  the  CC 
configuration  consists  of  Hip-k  mappings,  the  mapping  that  will  establish  the  path  p,  is  Af,  where 
k  =  ;  xor  j  . 
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Structure 

Nodes 

MCL 

1 

Optimal 

Ring 

N 

2 

yes 

Mesh 

N=m‘ 

B 

yes 

Hypercube 

N=2'' 

ft 

yes 

Cube -Connected  Cycle 

N=2'' 

3 

yes 

Complete  Binary  Tree 

N=2''-l 

4 

9 

Table  1.  Embedding  Regular  Structures  in  TDM-MINs 

Given  a  CR  graph  containing  a  set  of  edges  £  and  a  CC  configuration  [A/i,  M2 . Mf^  ].  an  MC 

configuration  that  establishes  the  edges  in  the  CR  graph  can  be  found  by  using  the  selection  algorithm 
below.  We  use  the  symbol  [  1  to  denote  an  empty  MC  configuration  with  no  mappings  and  the  operation 
I  I  to  denote  the  addition  of  a  mapping  to  an  MC  configuration. 

Selection  Algorithm 

1.  Set  initially  MC  =  (  ] 

2.  Foreachedgep,_,y  e  £  repeat 

2. 1 ,  Determine  the  mapping  such  that  p,  e  M* 

2.2.  \fMk  i  MC  then  MC  =  MC  I  I  M^ 

For  example,  consider  the  CR  graph  in  Figure  2(b)  and  the  CC  configuration  consisting  of  flip-k 
mappings.  The  MC  configuration  selected  by  the  algorithm  is  [A//(i),  Af/c).  Af/(3)).  Since  r  =  3  and 
A'  =  8.  a  3-way  rather  than  an  8-way  TDM-MIN  may  be  used  for  the  application  to  improve  the 
efficiency.  Note  that,  the  maximum  number  in  the  set  of  in-degrees  and  out-degrees  of  nodes  in  the  graph 
is  2  and.  thus,  th-  configuration  may  not  be  optimal  In  fact,  an  optimal  MC  configuration  with  t  =  2  will 
be  obtained  in  the  next  section. 

Probabilistic  analysis  of  the  average  multiplexing  cycle  length  of  MC  configurations  for  random  CR 
graphs  can  be  carried  out  as  follows.  Given  a  CR  graph  in  which  S  out  of  A/  source  nodes  are  each  con¬ 
nected  randomly  to  D  out  of  N  destination  nodes,  define  ^  and  d  =  ~.  The  probability  that  an 

edge  is  established  in  any  mapping  of  a  given  CC  configuration  is  Since  edges  that  go  out  from  the 

same  source  node  must  be  established  in  different  mappings,  exactly  D  mappings  arc  needed  to  establish 
paths  from  a  source  node  to  D  destination  nodes.  Thcfcforc,  after  selecting  D  mappings  from  the  given 
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CC  configuration,  the  probability  that  any  mapping  has  not  been  selected  isp  =  \  -  d.  Since  each  source 
node  randomly  selects  D  mappings  independent  of  others,  the  probability  that  a  mapping  in  the  CC 
configuration  remains  un-selected  after  all  S  source  nodes  have  selected  their  mappings  is  P  =p^ .  The 
probability  that  exactly  /  mappings  have  been  selected  for  an  MC  configuration  is  thus 


Prob  (i)  = 


/>N-.(1  _/>)< 


(3.5) 


Therefore,  the  average  (expected)  number  of  mappings  selected,  that  is,  the  expected  multiplexing 
cycle  length  of  a  MC  configuration  is 

tav  = (3.6) 


Clearly,  tav  -  5 ,  since  S  is  the  out-degree  of  a  source  node.  The  percentage  of  communication  load 
of  an  application  can  be  approximatea  by  the  ratio  of  l£  I  versus  A/*.  In  the  above  analysis,  the  load  per¬ 
centage  is  proportional  to  s^-d.  Figure  3  shows  calculated  values  of  for  different  system  size  N  under 

different  load  conditions  assuming  s  =  It  can  be  seen  that  with  reasonably  small  system  size  and  under 
low  load  condition,  the  selection  algorithm  can  generate  an  MC  configuration  that  improves  over  a  CC 
configuration.  Note  that,  by  using  the  CC  configuration  with  flip-k  mappings,  the  selection  algorithm 
can  be  simple  and  so  does  the  resulting  MC  configuration  because  of  individual  stage  control.  The  time 
complexity  of  the  algorithm  is  linear  in  the  number  of  connections  requested,  that  is  O  ( I  £  I ). 


12  3  4  5  6  7 


logN 


Figure  3.  Percentage  of  mappings  selected 


The  problem  with  the  selection  algorithm  is  that  it  restricts  itself  to  the  set  of  ;V  mappings  in  a 
chosen  CC  configuration.  With  individual  switch  control,  however,  a  total  of  2"”'  mappings 

(excluding  partial  mappings)  arc  possible  in  the  MIN.  Given  a  CR  graph  with  a  set  of  edges  £,  a  MC 
configuration  can  be  obtained  by  composing  mappings  based  on  the  set  £,  which  may  be  different  from 
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any  mapping  in  a  chosen  CC  configuration.  More  specifically,  assuming  individual  switch  control,  the 
following  composition  algorithm  composes  each  mapping  in  a  greedy  fashion.  That  is,  starting  with  an 
empty  set  of  paths,  a  mapping  is  composed  by  including  as  many  required  paths  as  possible  provided  that 
they  do  no  conflict. 

The  Composition  Algorithm 

Set  MC  =  [  ]  and  k  =  \  .  Repeat  until  E  is  empty 

1.  Reset  mapping  A/*  =  (j)  and  all  elements  of  SS  )  to  "x" 

2.  For  each  edge  p,_,j  e  E 

If  {p,  )  is  compatible  with 

2A.  Mjc  =Mjc  j  and  update  SS  (A/* )  accordingly 

2.2.  delete  p,  from  the  set  E . 

3. MC  =  MC  \  \  Mk  mdk  =k  +  l 

For  example,  an  MC  configuration  that  is  composed  by  the  algorithm  for  the  CR  graph  in  Figure 
2(b)  is  [A/i.A/:!-  For  easy  verifications  by  the  readers,  we  will  show  each  mapping  in  the  MC 
configuration  with  its  corresponding  switch  setting  array  in  Eq.  3.7. 

M ,  =  aOM  (1.0).  (2,3).  (3.2).  (4.5).  (5.4).  (6.7).  (7.6); 

A/2  =  ;(1,3),  (2.1),  (5.6).  (7.5);  (3.7a) 

Their  switching  setting  arrays  are  respectively: 

001  [oil 

SS(MA=  8  8  1  SS(M2)=  ^  i  (3.7b) 

0  0  1  oil 

In  this  example,  the  MC  configuration  composed  by  the  algorithm  is  optimal  with  t  =  2,  which 
improves  over  the  MC  configuration  generated  by  the  selection  algorithm.  Simulations  have  been  done  to 
determine  the  average  multiplexing  cycle  length  of  MC  configurations  composed  by  the  composition 
algorithm.  A  random  number  generator  is  used  to  generate  D  distinct  destination  nodes  for  each  of  the  5 
source  nodes.  Figure  4  shows  simulation  results  of  f„,.  for  different  system  sizes  under  different  load  con- 

ditions  where  .s  =  ^  is  equal  to  c/  =  ^.  It  can  be  seen  that  under  low  or  medium  load  conditions,  the 

composition  algorithm  improves  over  the  selection  algorithm  as  expected.  However,  when  the  load  is 
extremely  high,  the  multiplexing  cycle  length  of  an  MC  configuration  could  c.xcced  /V.  TTiat  is. 
configuratioas  under  high  load  condition  using  this  algorithm  may  be  worse  than  simply  using  a  CC 

configuration.  Note  that,  ^  docs  not  vary  much  with  the  system  size  N .  Figure  5  shows  simulation 

results  for  a  system  with  iW  =  32  with  various  s  and  J.  Given  that  it  takes  OdojtN)  time  to  compute 
switch  settings  for  a  path,  the  composition  algorithm  has  the  time  complexity  of  0(\E\~  lo^N ). 
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Figure  4.  Percentage  of  mappings  composed 


Figure  5.  Different  load  conditions  when  N  =32 

It  is  desirable,  not  only  to  do  better  than  the  selection  algorithm  under  low  or  medium  load  condi¬ 
tions.  but  also  to  bound  the  multiplexing  cycle  length  of  any  MC  configuration  by  N  under  high  load  con¬ 
ditions.  One  way  is  to  use  the  selection  algorithm  first  to  detcnnine  a  set  of  up  to  N  flip  -k  mappings 
needed.  Then  each  mapping  is  examined  to  see  if  it  can  be  deleted  from  the  configuration  by  migrating 
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paths  established  in  it  to  other  mappings  in  the  configuration.  Given  a  CR  graph  with  a  set  £ ,  we  can  use 
the  following  merge  algorithm  to  achieve  the  above  objective. 

The  Merge  Algorithm 

1.  Run  the  selection  algorithm  to  determine  an  MC  configuration. 

2.  For  each  mapping  Af*  €  MC,  repeat  step  3 

3.  If  every  p,  g  Af*  is  such  that  {p,  }  is  compatible  with  a  Af/  currently  in  MC  where  /  *  k 

3. 1 .  For  every  p,  _,y  g  Af* 

Ml  =  Ml  {  p,^j  I  if  (  pi^j  }  is  compatible  withAf/ 

3.2  Remove  Af*  from  MC 

Simulations  have  been  done  under  similar  assumptions  to  those  used  for  the  composition  algorithm. 
Figure  6  shows  the  results  of  the  merge  algorithm  for  a  system  with  N  =  32.  It  can  be  seen  that  the  merge 
algorithm  performs  as  good  as  the  composition  algorithm  under  low  or  medium  load  conditions  and  con¬ 
verges  to  the  selection  algorithm  under  high  load  conditions.  The  complexity  of  this  algorithm  can  be 
shown  to  be  O  (Af  I  £  I  ‘  logA/ ). 


Figure  6.  Performance  of  the  merge  algorithm 


4J.  DYNAMIC  RECONFIGURATIONS 

Static  reconfigurations  work  well  if  all  paths  are  required  to  be  established  from  the  beginning  to  the 
end  of  executions  of  an  application.  For  applications  such  as  those  in  telecommunication,  connection 
requests  are  usually  generated  at  run  time.  Even  if  a  CR  graph  that  contains  all  edges  needed  during  the 
execution  can  be  constructed  at  compile  time,  it  may  be  inefficient  to  perform  static  reconfigurations 
based  on  such  graph  since  some  paths  are  used  only  for  a  certain  duration  of  time  and  are  wasted  for  the 
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remaining  time  during  execution.  It  is  possible  to  achieve  more  efficient  communication  in  such  an  appli¬ 
cation  by  reconfiguring  a  TDM-MIN  dynamically  based  on  run  time  requests.  This  means,  mappings 
realized  in  a  time  slot  may  be  different  from  time  to  time.  We  call  such  reconfigurations  dynamic 
reconfigurations. 

4J.1.  Centralized  Reconfigurations 

Run  time  requests  may  include  requests  for  establishing  new  paths  and  releasing  existing  ones. 
Dynamic  reconfigurations  can  be  done  incrementally  based  on  an  existing  configuration.  If  a  request  is  to 
establish  a  path  pi_,j .  the  current  configuration  is  examined  to  find  a  mapping  Af*  that  is  compatible  with 
[pi  }.  If  successful,  path  pi  is  added  to  mapping  A/*  by  updating  SS  (A/* ).  In  the  array,  only  elements 
that  correspond  to  switches  along  the  new  path  and  have  value  of  "x"  are  updated  to  either  "0"  or  "1". 
Consequently,  the  k  -th  bit  of  each  shift  register  of  those  switches  whose  corresponding  elements  have 
changed  values  may  need  to  be  updated.  The  source  node  /  also  sets  its  k  -th  entry  in  the  list  of  output 
ports  to  j .  At  this  time,  reconfiguration  based  on  the  request  is  completed. 

If,  however,  the  current  configuration  does  not  contain  any  mapping  that  is  compatible  with  {p,_>y }, 
a  new  mapping  that  establishes  p,-tj  can  be  added  to  the  configuration.  This  requires  that  all  source 
nodes  be  informed  of  the  additional  time  slot  in  the  multiplexing  cycle.  Shift  registers  of  all  switches 
have  to  be  updated  accordingly.  Before  adding  the  new  mapping,  one  can  migrate  existing  paths  in  a 
mapping  into  other  mappings  so  that  it  may  become  compatible  with  {p,-,j }.  This  way.  the  new  mapping 
may  be  avoided.  Note  that  there  are  tradeoffs  between  overheads  of  migrating  paths  and  overheads  of 
adding  a  new  mapping. 

If  a  run  time  request  is  to  release  a  path,  the  mapping  that  currently  establishes  the  path  may  be 
deleted  if  all  remaining  paths  in  the  mapping  can  be  migrated  into  other  mappings  in  the  configuration. 
Such  explicit  release  requests  may  not  be  necessary  if  replacement  algorithms  or  garbage  collection  algo¬ 
rithms  are  used  by  the  central  controller  based  on  the  usage  of  existing  connections.  All  these  involves 
tradeoffs.  Note  that,  dynamic  reconfiguration  can  also  be  done  by  buffering  run  time  requests  and  period¬ 
ically  executing  a  static  reconfiguration  algorithm.  At  each  selected  instance,  a  snapshot  of  the  CR  graph 
is  constructed  based  on  aU  current  paths  that  need  to  be  established. 

4J.2.  Distributed  Reconfiguration 

Assume  that  nodes  connected  to  a  MIN  has  distributed  control  but  global  synchronization  is  still 
applicable.  In  this  case,  the  multiplexing  cycle  should  always  consist  of  a  fixed  number,  k .  of  time  slots. 
This  is  because  each  node  will  not  be  aware  of  cither  increment  or  decrement  of  the  number  of  time  slots 
in  the  multiplexing  cycle  (in  a  timely  way)  without  being  infomied  by  a  centralized  control  mechanism. 
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Each  source  node  that  wants  to  establish  a  path  reserves  a  time  slot  in  which  the  path  may  be  esta¬ 
blished  by  routing  a  reservation  packet  to  the  destination  on  the  path.  These  reservation  packets  use  links 
and  switches,  called  reservation  links  and  reservation  switches  respectively,  separate  from  those  used  by 
data  packets.  Note  that  the  separation  can  be  either  logical  or  physical.  An  example  of  logical  separation 
could  be  a  MIN  with  links  and  switched  used  by  reservation  packets  and  data  packets  in  a  time- 
multiplexed  way.  In  the  following  discussion,  terms  "link"  and  "switches"  are  used  to  refer  to  reservation 
links  and  reservation  switches  respectively. 

Let  an  input  port  of  a  switch  s  be  denoted  by  /  (s )  and  let  an  output  port  of  the  switch  be  denoted  by 
r  (s ).  Let  a  path  be  represented  by  a  sequence  of  n  =  logA/  pairs  of  left  and  right  ports  of  switches  at 

each  stage.  That  is,  p,^,  can  be  represented  by  {<l{s\).  r{si)>,  </(5:),  r(S2) . </ (5n).  r{Sn  )>).  Note 

that,  this  implies  that  ris, )  is  cormected  to  l(s,+i).  Let  every  output  port  r(X)  maintain  a  set  of  time  slots 
that  is  not  used  by  any  paths.  Denote  that  set  by  AVAL(r(X)).  Assume  that  each  source  node  also  main¬ 
tains  an  AVAL{l{Y))  list  for  an  input  port  l(Y)  to  which  it  is  connected.  Let  "lock"  and  "unlock"  be 
mutual  exclusive  operations  on  a  switch  port.  Only  the  reservation  packet  that  can  successfully  "lock"  the 
port  can  update  its  AVAL  list  while  other  reservation  packets  are  buffered  at  the  switch  until  the  port  is 
"unlocked".  When  the  TDM-MIN  system  is  started,  all  ports  are  unlocked  and  for  any  port  Z, 
AVAL{Z)=  {1,2 . k}. 

Each  reservation  packet  maintains  a  set  of  time  slots  that  are  available  for  possibly  establishing  the 
corresponding  path.  Denote  by  AVAL  (R )  the  set  of  available  time  slots  maintained  by  a  reservation 
packet  R .  When  a  reservation  packet  R  is  generated,  its  AVAL  (/? )  is  set  to  AVAL  {I  is  i )).  As  a  re.servation 
packet  goes  through  each  switch,  it  locks  the  corresponding  ports  and  updates  its  own  AVAL  {R )  to  the  set 
of  time  slots  that  are  available  at  all  switch  ports  visited  so  far.  If  the  reservation  packet  reaches  the  desti¬ 
nation,  it  chooses  a  time  slot,  namely  ts  €  AVAL(R).  and  returns  to  the  source  along  the  same  path  in 
reverse  order.  As  it  passes  each  switch,  it  deletes  the  ts  from  the  AVAJL  lists  of  each  port  visited  and 
unlocks  these  ports.  At  the  same  time,  the  ts  -th  bit  of  the  shift  register  of  the  switch  is  loaded  with  a 
proper  state.  When  it  comes  back  to  the  source  node,  the  destination  to  which  it  is  sent  to  is  recorded  in 
the  ts  -th  entry  of  the  list  of  output  ports  by  the  source  node. 

Before  the  control  packet  reaches  its  destination,  li  AVAL  {R )  would  become  empty  at  a  switch,  the 
reservation  packet  may  be  blocked.  Two  strategies  similar  to  "holding"  and  "dropping"  in  circuit  switch¬ 
ing  can  be  used  when  a  packet  is  blocked.  If  holding  is  used,  the  packet  stays  in  the  buffer  of  the  switch. 
An  advantage  is  that  whenever  some  paths  using  the  same  switch  port  arc  released,  the  rcsers  ation  packet 
can  continue  its  routing  without  repeating  from  the  source  up  to  that  switch.  However,  a  disadvantage  is 
that  switch  ports  that  have  been  locked  by  the  packet  can  not  be  used  by  other  rcsersation  packets  while 
the  packet  is  blocked.  An  alternative  is  to  use  dropping,  in  which  the  reservation  packet  reverses  its  way, 
unlocking  switch  ports  and  undoing  changes  to  AVAL  sets  of  switches.  The  source  node  may  queue  the 
packet  and  try  to  send  the  packet  again  after  a  random  interval.  Note  that,  a  combination  of  these  two 


strategies,  which  drops  a  packet  alter  holding  it  for  a  certain  period,  can  also  be  used. 

If  a  source  node  wants  to  release  a  path,  it  sends  a  cancellation  packet  R  with  AVAL  {R )  containing 
the  time  slot  in  which  the  path  is  established.  The  cancellation  packet  can  add  the  time  slot  in  AVAL  {R ) 
into  A.VAL  sets  of  every  switch  ports  visited  on  the  way  to  its  destination.  Assume  that  dropping  is  used, 
then  the  following  algorithm  may  be  executed  distributively  when  establishing  and  releasing  a  path. 

The  Distributed  Algorithm  (with  dropping) 

Establish(  {<l(s\).r{s\)>,  </(S2)-  ''(•^2) . r(s„)>) ) 

1.  The  source  node  connected  to  l(s\)  generates  a  reservation  packet  R  with 

AVAL  iR)=  AVAL  U(si)) 

2.  For  /  =  1  to  n  do 

2.1.  Lock  port  r(s, ).  AVAL  (R )  =AVAL(R)  AVAL(r(s, )) 

2.2.  \iAVAL(R)  =  <^  then 

2.2. 1 .  For  j  =  /  downto  1  do  unlock  port  r{Sj ) 

2.2.2.  The  source  node  realizes  the  path  is  not  established 

2.2.3.  Exit  this  procedure 

3.  Choose  a  time  slot  ts  e  AVAL  {R ) 

4.  For  /  =  n  downto  1  do 

4.1  AVAL(r{s,))  =  AVAL(r{s,))  -  {ts}  and  unlock  port  r(s, ). 

5  AVAL{l{sO)  =  AVAL(l{si))-{ts}. 

Release!  (</(a'  1).  r(5i)>,  </(s;).  risz) . <l(s„ ).  /•(a'„)>)  ) 

1 .  The  source  node  connected  to  1{S])  generates  a  cancellation  packet  R  with  AVAL  i,R)  =  { ts  } 

2.  For  /  =  1  to  n  do 

2.1.  Lock  port  r{s, ).  AVAL  {r{s, ))  =  AVAL(r(s,  ))\j  (tsj.  Unlock  port  r(s, ) 

3.  AVALUis  I ))  =  AVALiKs  1 ))  {ts } 

Note  that,  when  the  queue  containing  unscnt  packets  is  not  empty,  a  source  node  may  periodically 
execute  the  procedure  EstahlishO  from  step  2.  An  algorithm  with  holding  can  be  similarly  written,  so 
does  an  algorithm  with  the  combination  of  holding  and  dropping. 


16 


4.4.  SUMMARY 


To  summarize,  reconfiguration  with  TDM  is  a  connection  paradigm  that  can  be  applied  to  mul¬ 
tiprocessor  systems  using  multistage  interconnection  networks.  It  provides  more  architectural  flexibilities 
and  could  achieve  potentially  higher  communication  bandwidthes  than  conventional  switching  methods. 
It  is  especially  promising  for  optical  interconnection  networks  because,  first,  high  optical  communication 
bandwidths  make  time-division  multiplexing  feasible  and  more  attractive:  Important  properties  of  optical 
signal  propagation,  namely  unidirectional  propagation  and  predictable  path  delay,  enable  pipelined 
transmissions  over  optical  waveguides.  The  way  in  which  switches  are  set  in  TDM-MIN  can  further  sim¬ 
plify  pipelinings  between  stages.  Second,  partitioning  connection  requests  and  establishing  subsets  in  a 
time  division  multiplexed  way  can  simplify  controls  and  eliminate  the  needs  for  message  relaying,  optical 
delay  loops  (optical  time-slot  interchangings)  and  costly  conversions  between  optical  and  electronic  sig¬ 
nals.  As  a  result,  current  photonic  switching  technology,  can  be  readily  adopted  for  TDM-MINs  imple¬ 
mentations. 
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Frontiers  of  Massively  Parallel  Computation,  Z.  Guo.  R.  Melhem,  R.  Hall,  D.  ChiaruUi.  and 
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•  “Coincident  pulse  techniques  for  multiprocessor  interconnection  structures”:  S.P.  Levitan. 
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•  "Optical  bus  control  for  distributed  multiprocessors”;  D.M.  Chiarulli,  S.P.  Levitan,  and  R.G. 
Melhem;  Journal  of  Parallel  and  Distributed  Computing,  10:45-.54,  1990. 
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Reviewer  for  National  Science  Foundation 

Textbook  review  for  Boyd  and  Frasier  Publishing 

Textbook  review  for  Prentice  Hall  Publishing 

Referee  for  Applied  Optics 

Referee  for  IEEE  Computer 

Referee  for  IEEE  Tranactions  on  Computers 

Referee  for  Journal  of  Parallel  and  Distributed  Computing 

Referee  for  International  Symposium  on  Computer  Arithmetic 

Referee  for  International  Conference  on  Parallel  Processing 

University  Service 

Executive  Committee  for  Academic  Computing  (EC AC),  1989-1991,  This  is  the  primary 
faculty  oversight  committee  for  Computing  and  Information  Services. 

EC  AC  Mainframe  Services  Subcommittee,  1988-1989,  Formerly  the  Performance  and  Ser¬ 
vice  Priorities  Subcommittee,  responsibilities  include  the  oversight  of  computing 
center  services  relating  to  the  VAX  cluster  and  VAX  ULTRIX  systems.  Specific 
contributions  include  a  set  of  performance  measures  which  were  adopted  for  month¬ 
ly  reporting  to  the  committee. 

EC  AC  Budget  and  Planning  Subcommittee,  1989-1991,  This  subcommittee  is  responsible 
for  review  of  all  ECAC  budgetary  matters.  The  committee  has  also  periodically  pro¬ 
duced  long  range  planning  documents  on  university  wide  computing. 

EE/CS  Computer  Engineering  Committee,  1987-1991,  This  committee  is  evaluating  curri¬ 
culum  and  administrative  issues  related  to  introduction  of  a  Computer  Engineering 
program. 


Donald  M.  Chiarulli 


Page  28 


Departmental  Service 

Computing  Needs  Committee  (Chairman),  1987-1991,  This  committee  evaluates  depart¬ 
mental  computing  resources  on  an  annual  basis  to  establish  aquisition  priorities. 

Graduate  Program  Committee  (Chairman),  1990-1991,  This  committee  is  responsible  for 
curriculum,  new  courses,  and  administrative  matters  relating  to  graduate  studies  as 
well  as  the  annual  administration  of  the  preliminary  exam.  A  specific  contribution 
while  chairman  was  a  significant  reorganization  of  the  mles  for  the  preliminary 
exam. 

Undergraduate  Curriculum  Committee,  1989-1990,  As  a  member  of  this  committee,  I 
participated  in  the  restructuring  of  of  the  undergraduate  core  courses  for  computer 
science,  including  a  conversion  to  lecture/recitation  formats.  Specific  contributions 
include  an  overhaul  of  the  undergraduate  systems  course  sequence. 
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Steven  Peter  Levitan 

Department  of  Electrical  Engineering 
Benedum  Engineering  Hall 
University  of  Pittsburgh 
Pittsburgh.  PA  15261 


PROFESSIONAL  INTERESTS 

The  design,  modeling,  and  simulation  of  highly  parallel  systems,  including  parallel  computer  architectures, 
parallel  algorithms,  and  '.’LSI.  Additional  interests  include  design  tools  and  methodology  for  software, 
hardware,  and  VLSI. 

CURRENT  POSITION 

Wellington  C.  Carl  Assistant  Professor  in  the  Department  of  Electrical  Engineering  at  the  University 
of  Pittsburgh. 

EDUCATION 

Ph.  D.  May  1984  University  of  Massachusetts  Department  of  Computer  and  Inforntation  Science 
(COINS).  Dissertation  title;  “Parallel  Architectures  and  Algorithms;  A  Programmer’s  Perspective”. 
.Advisor;  Caxton  C.  Foster. 

M.  S.  September  1979  University  of  Massachusetts,  (COINS),  Specialization;  Computer  Systems. 

B.  S.  J  une  1972  Case  Western  Reserve  University,  School  of  Engineering.  Major;  Computer  Science, 
Minor;  Electrical  Engineering 

PROFESSIONAL  POSITIONS  HELD 

Wellington  C,  Carl  Assistant  Professor,  1987-;  Department  of  Electrical  Engineering,  University  of 
Pittsburgh. 

Assistant  Professor,  1985-1986  (tenure  stream  began  Sept.,  1985);  Department  of  Electrical  and 
Computer  Engineering  (ECE),  University  of  Massachusetts.  Amherst.  Director,  VLSI 
Laboratory!  1985-1986);  Responsible  for  coordination  and  direction  of  VLSI  design  laboratory.  Liaison  to 
the  Massachusetts  Technology  Park  Corporation,  Massachusetts  Microelectronics  Center  (MM(”). 

Visiting  Assistant  Professor,  1984-1985;  Electrical  and  Computer  Engineering,  University  of 
Massachusetts,  Amherst. 

Consultant,  1984-1987;  Viewlogic  Systems  Inc.  Developed  VLSI  design  and  simulation  software. 

Consultant,  1982-1983;  Digital  Equipment  Corporation.  Consulted  for  the  VLSI  Advanced  .Architectures 
(iroup  on  silicon  compilers,  simulation,  and  parallel  processing  issues. 

Engineer.  Summer  1982;  DEC.  One  of  the  team  which  developed  the  “Silicon  Synthesis  Project",  a  VLSI 
design  tool. 

Co-founder.  1980-1983;  Humanistic  Computing  Systems,  consultants  in  the  development  of  user-friendly 
software. 

Teaching  Assistant/Lecturer.  1978-1982;  Department  of  Computer  and  Information  Scienc*'  (COINS), 
University  of  N’assachusetts.  Amherst. 

Senior  Systems  Engineer.  1972-1977;  Xylogic  Systems  Inc.  Designed  serial,  parallel  and  D.MA  interfaces 
for  minicomputer  based  te.xt  processing  systems  used  in  newspaper  production.  1  trained  and  supervised  m 
house  test  and  field  service  personnel.  Largest  project  was  a  multi-computer,  dual-chain  disk  controller 

Test  Technician,  1972:  ARP  Inc.  Tested  and  repaired  music  synthesizers,  and  trained  repair  personnel 
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PUBLICATIONS 

Refereed  Journal  Publications: 


1.  “SPAR:  A  Schematic  Place  and  Route  System”;  Stephen  T.  Frezza  and  Steven  P.  Levitan; 
(submitted)  IEEE  Transactions  on  Computer  Aided  Design  of  Integrated  Circuits. 

2.  “A  Systems  Theoretic  Approach  to  the  Functional  Characterization  of  the  Hippocampal  Formation”; 

R. J.  Sclabassi,  D.N.  Krieger,  German  Barrionuevo,  S.P.  Levitan,  T.W.  Berger;  (submitted)  Annals  of 
Biomedical  Engineering,  1991. 

3.  “Optical  Multicasting  in  Linear  Arrays”;  Chunming  Qiao,  R.G.  Melhem,  S.P.  Levitan  and  D  M. 
Chiarulli,(in  press)  International  Journal  on  Optical  Computing. 

4.  “An  All  Optical  Addressing  Circuit:  Experimental  Results  and  Scalability  Analysis”;  Donald  M. 
Chiarulli,  Robert  M.  Ditmore,  Steven  P.  Levitan,  and  RamiG.  Melhem;/^^!?  Journal  of  Lightwave 
Technology,  Vol.  9,  No.  12,  pp.  1717-1725,  1991. 

5.  “Pipelined  Communications  In  Optically  Interconnected  Arrays”;  Z.  Guo,  R.G.  Melhem,  R.VV.  Hall, 

D  M.  Chiarulli,  and  S.P.  Levitan;  Journal  of  Parallel  and  Distributed  Computing,  Vol.  12,  No.  3.  pp. 
269-282,  1991. 

6.  “An  Interactive  Toolset  for  Characterizing  Complex  Neural  Systems”;  D.N.  Krieger,  T.W.  Berger, 

S. P.  Levitan,  and  R.J.  Sclabassi;  Computers  and  Mathematics,  Vol.  20,  Mathematical  Models  in 
Medicine,  No. 4-6,  pp.  231-246,  1990. 

7.  “Coincident  Pulse  Techniques  for  Multiprocessor  Interconnection  Structures”;  S.P  Levitan,  D  M. 
Chiarulli,  R.G.  Melhem;  Applied  Optics;  Vol.  29,  No.  14,  pp.  2024-2033,  May,  1990. 

8.  “Optical  Bus  Control  for  Distributed  .Multiprocessors”;  D.M.  Chiarulli,  S.P.  Levitan,  R.G.  Melhem; 
Journal  of  Parallel  and  Distributed  Computing;  Vol.  10,  No.  1,  pp.  45-54,  1990. 

9.  “Space  Multiplexing  of  Optical  Waveguides  in  a  Distributed  Multiprocessor”;  R.G.  Melhem,  D.M. 
Chiarulli  and  S.P.  Levitan;  The  Computer  Journal,  British  Computer  Society,  Vol.  32,  No.  4,  pp. 
362-369,  1989. 

10.  “The  Image  Understanding  Architecture”;  C.  C.  Weems.  S.  P.  Levitan.  A.  R.  Hanson,  E.  M,  Riseman. 
J.  G.  Nash,  D.  B.  Shu;  International  Journal  of  Computer  Uisjon  Vol.  2.  pp.  251-282  (1989). 

11.  “Using  Coincident  Optical  Pulses  for  Parallel  Memory  Addressing”;  D.  Chiarulli,  R.  Melhem, 

S.  Levitan;  Computer  Vol.  20,  No.  12,  pp.  48-57.  December,  1987. 

Chapters  in  Edited  Books: 

1.  “Nonlinear  Systems  Analysis  of  .Network  Properties  of  the  Hippocampal  Formation”;  T.W.  Berger. 

G.  Barrionuevo,  S.P.  Levitan.  D.N  Krieger,  and  R.J,  Sclabassi;  pp.  283-352;  Learning  and 
Computational  Neuroscience:  Foundations  of  .Adaptive  .Networks.  M.  Gabriel  and  J.  W.  Moore 
(Eds  ),  M.I.T.  Press,  1990. 

2.  “Theoretical  Decomposition  of  Neuronal  Networks”;  R.J.  Sclaba.ssi,  D  .N.  Krieger,  J.  Solomon,  J. 
Samosky,  S.P  Levitan,  and  T.W.  Berger;  (in)  Advanced  .Methods  of  Physiological  System  .Modeling. 
Vol.  2.  V.Z.  .Marmarelis  (Ed  ),  pp.  129-146,  Plenum  Press.  .New  York,  1989. 

3.  “Using  VHDL  as  a  Langtiage  for  Synthesis  of  CMOS  VLSI  Circuits"  ;  S.P.  Levitan,  .\.R.  Martello. 
R.M.  Owens.  M.J.  Ifwin;  (in)  Computer  Hardware  Description  Languages  and  their  .Applications  J.A, 
Darringer  and  F  J,  Ramming,  Eds.;  Elsevier.  Amsler<larn,  1989;  pp  331-346;  IFIP  WG  10.2,  9tli 
Inti.  Symp  on  Computer  Hardware  Description  Languages:  Washington  D.C.,  June.  1989. 

4  “The  UMass  Image  Understanding  ,\rrhitecture” ;  Slevi'ii  P  Levitan,  ( ‘harles  C.  Weems.  .Mien  R. 
Hanson  and  Edward  M  Risernan;  (m)  Parallel  Computer  \'ision:  Leonard  Ulir  (Ed  ).  Academic 
Press.  New  York,  19S7;  pp  215-248 
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5.  "'Measuring  Communication  Structures  in  Parallel  Architectures  and  Algorithms”;  Steven  P.  Levitan 
(in)  The  Characteristics  of  Parallel  Algorithms;  L.  Jamieson,  D.  Gannon,  and  R.  Douglass  (Eds.), 
Cambridge,  MA;  MIT  Press,  1987;  pp.  101-137. 

6.  "Signal  to  Symbols:  Unblocking  the  Vision  Communications/Control  Bottleneck”;  Steven  P.  Levitan, 
Charles  C.  Weems,  Edward  M.  Riseman;  (in)  VLSI  Signal  Processing  (proceedings  of  the  1984  IEEE 
Workshop  on  VLSI  Signal  Processing  at  University  of  Southern  California,  Los  Angeles.  CA; 
November  27-29,  1984);  IEEE  Press;  New  York,  NY;  1984;  pp.  411-420. 

7.  “A  Content  Addressable  Array  Parallel  Processor  and  Some  Applications”;  Charles  C.  Weems, 

Steven  P.  Levitan,  Daryl  T.  Lawton,  and  Caxton  C.  Foster;  (in)  [mage  Understanding,  Proceedings 
of  the  DARPA  Workshop,  Arlington,  Virginia;  June  23,  1983;  Science  Applications,  Inc.  Report 
Number  SAI-84-176-WA. 


Refereed  Conference  Proceedings: 


1.  "Efficient  Channel  Allocation  for  Routing  in  Optically  Interconnected  Mulitprocessor  Systems"  ; 

C.  Qiao,  R.  Melhem,  D.  M.  Chiarulli,  S.  P.  Levitan;  SPIE  Symposium  on  OE/Aerospace 
Sensing’92,  Conf.  on  Advances  in  Optical  Information  Procesing  V;  Orlando,  FI.;  1704-25; 
April  20-'24,  1992. 

2.  "Temporal  Specification  Verification  via  Causal  Reasoning”;  A.  R.  Martello,  S.  P.  Levitan;  Taii'92: 
ACM  International  Workshop  on  Timing  Issues  in  the  Specification  and  Synthesis  of 
Digital  Systems;  Princeton,  New  Jersey;  March  18-20,  1992. 

3.  "Architectural  Synthesis  via  VHDL”,  S  P.  Levitan,  B.  Pangrie,  Y.W’.Hsieh;  Third  Physical  Design 
Workshop;  Nemacolin  Woodlands,  PA,  May  20-23,  1991. 

4.  "Multicasting  in  optical  bus  connected  processors  using  coincident  pulse  techniques";  Chunming 
Qiao,  R.  .Melhem,  D.  Chiarulli  and  S.  Levitan;  International  Conference  on  Parallel 
Processing;  (poster);  St.  Charles,  IL,  Augu.st  20-23,  1991. 

5.  "Demonstration  of  an  All  Optical  Addressing  Circuit”;  D.  Chiarulli.  S.  Levitan,  R.  Melhem;  Optical 
Society  of  America  Topical  Meeting  on  Optical  Computing;  Technical  Digest  Vol.  6,  Tu(”3-1, 
pp. 235-238;  Salt  Lake  City,  UT,  March  4-6,  1991. 

6.  "Self  Routing  Interconnection  Structures  Using  Coincident  Pulse  Techniques";  D.  Chiarulli,  S 
Levitan,  R.  Melhem;  SPIE  OE/Boston'90,  1390-25,  S4,  pp  403-414;  November  4-9.  1990. 

7  "Pipelined  Communications  on  Optical  Busses”;  Z.  Guo,  R.  Melhem,  R.  Hall.  D.  Chiarulli.  S. 

Levitan:  SPIE  OE/Boston'90,  1390-26,  S4,  pp.  41.5-426:  November  4-9,  1990. 

8.  "The  Identification  of  Hippocampal  Network  Function”;  R.J.  Sclabassi.  D.N  Krieger,  G  Barrionuevo. 
S.P.  Levitan,  and  T.W.  Berger;  Proceedings  of  the  IEEE  Annual  International  Conference 
of  Engineering  in  Medicine  and  Biology  Society;  Vol.  12,  .No.  4,  pp. 1886-1888:  Philadelphia. 
October,  1990 

9.  "A  Fault  Tolerant  Design  of  the  Generalized  Cube  Network”;  T.D  Han,  D  A.  ("arlson.  and  S.P 
Levitan;  Proceedings  of  the  ISMM  International  Conference  on  Parallel  an<l  Distributed 
Computing,  and  Systems;  pp.  160-165;  October  10-12,  New  York:  R.,\.  .Vmmar,  Editor;  .Vcta 
Press,  1990. 

10.  "Array  Proces.sors  with  Pipelined  Optical  Busses";  Z.  Guo,  R.  Melhem,  R.  Hall,  D  Chiarulli.  S 
Levitan;  IEEE  EVontiers'90:  3rd  Symposium  on  the  Frontiers  of  Massively  Parallel 
Computation:  pp.  333-342;  University  of  Maryland.  College  Park,  MD,  October  8-10,  1990. 

11  "Causal  Timing  Verification";  A.  R.  Martello.  S.  P  Levitan:  Tau'90:  ACM  Int»>ruational 
Workshop  on  Timing  Issru^s  in  the  Specification  and  Syntln'sis  of  Digital  Systems; 
Vancouver,  BC;  .\ugust  15-17,  1990 
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12.  "Timing  Verification  Using  HDTV'”;  A.  R.  Martello,  S.  P.  Levitan,  D.  M.  Chiarulli;  Proceedings  of 
the  27th  Design  Automation  Conference,  pp.  118-123,  June  1990. 

13.  "Modeling  of  Neuronal  Networks  Through  Decomposition”;  R.J.  Sclabassi,  J.  Samosky,  D.N.  Krieger. 
J.  Solomon,  S.P.  Levitan,  and  T.VV.  Berger;  International  Joint  Conference  on  Neural 
Networks;  pp,  1773-1780;  Wcishington ,  D.C.  June  1989. 

14.  "A  V'LSI  CAD  System  for  VHDL”;  S.  P.  Levitan,  R.  M.  Owens,  M.  J.  Irwin;  Colorado 
Microelectronics  Conference;  pp.  1-8;  Antlers  Hotel,  Colorado  Springs,  CO  March  30-31,  1989. 

15.  "An  Input/Output  Model  of  the  Hippocampal  Formation”;  Robert  J.  Sclabassi,  Don  Krieger. 
Jacqueline  Solomon,  Steven  P.  Levitan,  German  Barrionuevo.  Theodore  Berger;  Society  for 
Neuroscience  Abstracts;  14th  Annual  .Meeting  of  the  Society  for  .Neuroscience;  Vol.  14,  p.  247; 
Toronto,  November  13-18th  1988. 

16.  "An  External  Network  Model  of  the  Hippocampal  Formation”;  Robert  J.  Sclabassi,  Don  Krieger, 
Jacqueline  Solomon,  Steven  P.  Levitan,  German  Barrionuevo.  Theodore  Berger;  International 
Neural  Network  Society  Abstracts,  Neural  Networks  (Supplement)  Conference  Proceedings; 
Boston,  MA;  September,  1988;  vol.  1,  p.  273. 

17.  "A  Neurophysiologic  Neural  Network  .Model”;  Don  Krieger,  Jacqueline  Solomon,  Steven  Levitan, 
Theodore  Berger,  German  Barrionuevo,  Robert  Sclabassi;  19th  Annual  Pittsburgh  Conference 
on  Modeling  and  Simulation;  May  5-6,  1988;  vol.  19,  pp  2397-2401 

18  "An  Easily  Reconfigurable,  Circuit  Switched  Connection  Network";  Deepak  Rana,  Charles  C  Weems, 
and  Steven  P.  Levitan;  IEEE  International  Symposium  on  Circuits  and  Systems;  Helsinki 
University  of  Technology;  Espoo,  Finland;  June  7-9,  1988. 

19  "Teaching  Computer  Architecture  as  Engineering  Design  with  VLSI”;  S.  P,  Levitan  and  J.  T.  Cain; 
21st  Annual  Hawaiian  International  Conference  on  Systems  Sciences  (UlCSS):  pp.  85-90, 
Kona,  HI,  January  5-8,  1988. 

20.  "VLSI  Design  of  High-Speed,  Low-Area  Addition  Circuitry”;  Tack-Don  Han,  David  A,  Carlson  and 
Steven  P  Levitan:  IEEE  Inti.  Conference  on  Computer  Design  (ICCD);  pp.  418-422;  Port 
Chester,  NY,  October  5-8,  1987. 

21.  "The  Image  Understanding  Architecture”;  Charles  C.  Weems,  Steven  P.  Levitan,  Allen  R.  Hanson 
and  Edward  M.  Riseman:  Proceedings  of  the  DARPA  Image  Understanding  Workshop:  pp. 
483-496;  Los  Angeles.  CA;  February,  1987; 

22.  "A  Testable,  Asynchronous  Systolic  Array  Implementation  of  an  HR  Filter”;  Deepak  Rana,  Steven  P 
Levitan.  David  A.  Carlson  and  Charles  E.  Hutchinson;  Custom  Integrated  Circuits  Conf.;  pp 
90-93;  Rochester,  NY,  May  12-15  1986. 

23.  "Evaluation  Criteria  for  Communication  Structures  in  Parallel  Architectures”:  Steven  P.  Levitan: 
1985  International  Conference  on  Parallel  Processing;  ,St.  Charles,  Ill.  August  20-23,  1!)85 

24.  "Iconic  to  Symbolic  Processing  Using  a  Content  Addressable  Array  Parallel  Processor",  D  Lawton. 

S.  Levitan,  C.  Weems,  E.  Riseman,  A.  Hanson,  .M.  Callahan;  Proceedings.  SPIE  Int.  Soc.  Opt. 
Engr..  Vol.  504  pp  92-111  (1984)  (Applications  of  Digital  Image  Processing  VII,  San  Diego.  (”.\. 
■•Viigiist  12-24.  1984). 

25  "Iconic  and  Symbolic  Processing  Using  a  Content  Atldressable  Array  Parallel  Processor",  Weems, 
D  Lawton,  S  Levitan,  E,  Riseman,  A.  Hanson:  Proceedings  of  the  IEEE  Computer  Socii'ty 
Conf.  on  Computer  Vision  and  Pattern  Recognition;  pp.  598-607;  San  Francisco,  CA:  .Iiine 
19-29.  1985. 

26  "Parallel  Processing  of  Ironic  to  Symbolic  Transformation  of  Images":  D.  1.  .Moldovan,  C  I.  Wii.  .1 
G  Nash,  S  P.  Levitan,  C  Weems:  Proceedings  of  the  IEEE  Computer  Society  Conf.  on 
Computer  Vision  and  Pattern  Recognition;  pp.  257-264;  .San  Francisco,  CA.  June  19-29.  1985 
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27.  'iconic  to  Symbolic  Processing  Using  the  Content  Addressable  Array  Parallel  Processor”;  Daryl  T. 
Lawton,  Steven  P.  Levitan,  Charles  C.  Weems,  Edward  M.  Riseman,  and  Allen  R.  Hanson; 
Proceedings  of  the  1984  Fall  Image  Understanding  Workshop;  New  Orleans,  LA,  October, 
1984. 

28.  “Development  and  Construction  of  a  Content  Addressable  Array  Parallel  Processor  for 
Knowledge- Based  Image  Interpretation”;  Charles  C.  Weems,  Steven  P.  Levitan,  Caxton  C.  Foster, 
Edward  M.  Riseman,  Daryl  T.  Lawton,  and  Allen  R.  Hanson;  Workshop  on  Algorithm-Guided 
Parallel  Architectures  for  Automatic  Target  Recognition;  Leesburg,  VA;  July  16-18,  1984. 

29.  “Titanic;  A  VLSI  Based  Content  Addressable  Parallel  Array  Processor";  Charles  C.  Weems,  Steven 
P.  Levitan,  and  Caxton  C.  Foster;  International  Conference  on  Computer  Circuits,  New  York. 
NY,  September  29  -  October  1,  1982. 

30.  “Algorithms  for  a  Broadcast  Protocol  Multiprocessor”;  Steven  P.  Levitan,  3rd  International 
Conference  on  Distributed  Computing  Systems,  Miaini/Ft.  Lauderdale,  FL,  October  18-22. 
1982. 

31.  “Finding  an  Extremum  in  a  Network”;  Steven  P.  Levitan  and  Caxton  C.  Foster,  9th  Annual 
International  Symposium  on  Computer  Architecture,  Austin,  TX,  April  26-29,  1982. 

32.  “Real-Time  LISP  Using  Content  Addressable  Memory”;  Jeffrey  G.  Bonar  and  Steven  P.  Levitan; 
1981  International  Conference  on  Parallel  Processing,  Bellaire,  Ml,  August  25-28,  1981. 

Technical  Reports  and  Popular  Journals: 


1.  “Fifth  Semi-Annual  Keystone  Research  Group  Meeting  May  3,  1991”  M.  J.  Irwin,  R.  .M.  Owens. 

B.  M.  Pangrle,  S.  P.  Levitan,  D  M.  Chiarulli,  and  D  E.  Setliff;  Department  of  Computer  Science. 
The  Pennsylvania  State  University,  CS-91-13  June,  1991. 

2.  “A  VHDL  Design  Environment”;  A.R  Martello  and  S.P  Levitan;  SIGDA  Newsletter,  Vol  20.  No.  3, 
pp.  52-67;  December,  1990. 

3.  “Fourth  Semi-Annual  Keystone  Research  (Jroup  .Meeting  November  19.  1990”  S  P.  Levitan, 

D.  .'VI.  Chiarulli,  D.  E.  Setliff,  .M.  J.  Irwin.  R.  .M.  Owens,  and  B  .M.  Pangrle;  Department  of  Electrical 
Engineering,  University  of  Pittsburgh  TR-CE- 90-002 

4.  “The  Keystone  Design  Environment:  Philosophy  and  .Methodology”;  S.P  Levitan.  D.E.  Setliff,  D.M. 
(’hiarulli,  M.J.  Irwin.  R..M  Owens,  and  B  Pangrle;  TR-CE-91-001,  Electrical  Engineering,  University 
of  Pittsburgh,  November.  1990 

5.  "Selected  Topics  in  Architecture,  Logic  and  Physical  Synthesis",  .M.J.  Irwin,  R...M.  Owens,  and  S.P. 
Levitan,  TR-CE-88-004.  Electrical  Engineering,  University  of  Pittsburgh.  November,  1988. 

6.  "Seed:  An  Icon  Based  Schematic  Editor”.  John  P  Elliott  and  Steven  P  Levitan;  PIC.A  Laboratory- 
Technical  Report  TR-('E-88-001,  Electrical  Engineering,  University  of  Pittsburgh,  July.  1988. 

7  "Asynchronous  Control  of  Optical  Bii.sses  in  Closely  Coupled  Distributed  Systems";  Donald  M. 
Chiarulli,  Rami  Melhem,  Steven  P  Levitan:  Technical  Report  88-2.  Department  of  Computer 
Science,  University  of  Pittsburgh,  1988. 

8  "The  Image  Understanding  .Architecture":  C.  (.’.  Weems.  S.  P.  Levitan,  A.  R  Han.soii,  E  M  Riseman, 
J  G  Nash,  D.  B.  Shu;  COINS  rechmcal  Report  87-76:  University  of  Ma.ss.arhusetts  at  .Amherst;  1987 

!t  Parallel  Algorithms  and  Arrhilectnres:  .4  Programmer's  Perspective .  Steven  F’  Levitan:  (’OF.N'S 
Technical  Report  84-11:  University  of  Massachusetts  at  .Amherst:  May  1984. 

19  "APF^-L-bSP"  (product  review  |:  Jeffrey  G  Bonar  and  .Steven  f’  Levitan, /J  June  1982. 

11  "Three  Microcomputer  I.ISPs" ( product  review):  Steven  I’  Levitan  .uul  Jeffrey  G  Boii;ir  ID'TT. 
August  1981 
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12.  “The  Super-Kim  Project:  A  6502  Microcomputer  System  for  the  Real-Time  Laboratory”;  Steven  P. 
Levitan,  COINS  Technical  Report,  July  1979. 

13.  “CAMEOS:  a  Content  Addressable  Memory  Enhanced  Operating  System";  Steven  P.  Levitan  and 
Caxton  C.  Foster,  COINS  Technical  Report.  March  1978. 

Invited  Presentations  and  Workshops: 


1.  “Optical  MIMD  Architectures”;  AFOSR  Workshop  on  Reconfigurable  Optical  Interconnects; 
Boulder,  CO,  March,  1992. 

2.  “Panel  on  the  Future  of  Optics  in  Computing”;  Supercomputing  ’91,  Albuquerque,  NM, 

November,  1991  (Panel  Chair). 

3.  “Keystone:  A  VHDL  Simulation  and  Synthesis  Environment  for  VLSI  Design”:  IBM  Thomas  J. 
Watson  Research  Center  Hawthorne,  NY,  October,  1991. 

4.  “Optical  Interconnection  Structures  For  Multiprocessor  Applications”;  University  of  Pittsburgh. 
Department  of  Electrical  Engineering,  January,  1991. 

5.  “Using  Keystone  for  Verification  and  Synthesis”;  Viewlogic  Systems,  Marlboro,  MA,  September. 
1990. 

6.  “Timing  Verification  of  Digital  Interfaces”;  Carnegie  Mellon  University,  Department  of  Electrical 
and  Computer  Engineering,  May,  1990. 

7.  “Optical  Parallel  Processing”;  Workshop  on  Optical  Neural  Networks,  Jackson  Hole.  VVY, 
February,  1990. 

8.  “Addressing  and  Control  in  Optical  Interconnection  Networks  for  Hybrid  Multiprocessors"; 
University  of  Colorado  at  Boulder,  Optoelectronic  Computing  Systems  Center,  Boulder,  CO, 
February,  1990. 

9  “The  Keystone  Silicon  Synthesis  Project”;  Viewlogic  Systems.  Marlboro,  MA  January,  1990. 

10.  “Silicon  Synthesis:  A  VHDL  Approach”;  IEEE  Student  Chapter,  Pennsylvania  State  University. 
November,  1989. 

11.  “Experiences  Using  V'HDL  in  the  Classroom";  VHDL  Users  Group  Meeting.  Sheraton  Hotel, 
Redondo  Beach,  CA. October,  1989 

12  "Synthesis  of  CMOS  Structures  from  VHDL”:  VHDL  Methods  Workshop.  University  of  Virginia, 
Charlottesville,  VA;  August,  1989. 

13.  “VLSI  Curriculum;  CAD  for  VLSI”;  VLSI  Education  Conference  &  Exposition.  Santa  Clara, 
CA,  July,  1989,  pp.  181-182, 

14.  “From  VHDL  to  Layout",  VHDL  Users  Group  Meciting.  Slieraton  Hotel,  Redondo  Beach, 
October,  1988 

15  “An  Integrated  Capture  and  Simulation  Tool  for  Digital  Designs";  Penn  State  University. 
September,  1988, 

16  “Architectures  anil  VLSr';  Workshop  on  Special  Computrtr  Architecture's  for  Robotics. 
International  (.'onference  on  Robotics  and  Automation,  Philadelphia,  April,  1988. 

17  NSF/MOSIS  Undergraduate  Education  Workshop.  .NSF,  November,  1987. 

18,  “Parallel  .Algorithms  and  Architecturi's;  A  Programmers  Perspective" ,  Taxonomy  of  Paralled 

Algorithms  Workshop.  Los  Alamos  National  Laboratory,  Santa  Fe,  New  Mexico,  November,  1983 

19  “Topic,s  In  Computer  .Architect  ure" ,  Smith  Colh'ge.  February,  l',t83. 
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Patents 

An  Optical  Selector  Switch,  (with  R.  Melhem,  and  D.  Chiarulli),  Approved  September  1988,  Number 

4,883,334. 

GRANTS 

Current: 

National  Science  Foundation,  5/92-5/95,  "A  Research  Experiences  for  Undergraduates  Site:  Training 
Students  to  Model  Polymer  Behavior  Through  Computer  Simulations";  $150,000  (Cl)  with  A.C. 
Balazs  (PI),  and  R.L.  Pinkus. 

National  Science  Foundation,  1/92-12/94,  "Temporal  Specification  Verification”;  $218,292  (PI). 

Guidance  Technologies,  1/92-1/93,  “Unrestricted  Gift”;  $17,929. 

Association  for  Computing  Machinery  -  SIGDA,  12/92-6/93,  ACM/SIGDA  "Creation  of  a  SIGUA 
Internet  Server”;  $23,834  (PI). 

National  Science  Foundation,  7/91-7/93,  “Distribution  of  VLSI  Design  Software  for  i^ducation  and 
Research”;  $97,618  (PI)  MIP-9101656. 

Association  for  Computing  Machinery  -  SIGDA.  11/90-6/93,  ACM/SIGDA  "Equipment  Support 
for  Design  Automation  University  Booth  1991- 1993” ;  $20,000  (PI). 

National  Institute  of  Mental  Health  (ADAMHA),  4/90-3/93  "Contribution  of  PCP  and  NMDA 
Receptors  to  Network  Properties  of  the  Hippocampal  Formation”;  $600,000  ($44,915  to  Electrical 
Engineering  to  date)  (Cl)  with  T  W.  Berger(PI),  R.  J.  Sclabassi(Cl),  G.  Barrionuevo,  D.  N.  Krieger. 
Program  I,  part  of  $2,270,700  Behavioral  Neuroscience  and  Schizophrenia  grant  under  Edward  .M. 
Strieker  (PI);  MH45156-01A1. 

Air  Force  Office  of  Scientific  Research,  7/89-7/92,  "Coincident  Pulse  Techniques  for  Hybrid 
Electronic/Optical  Computer  Systems”:  $479,511  ($108,562  to  Electrical  Engineering  to  date) 
(CO-PI)  with  D..M.  Chiarulli,  R.  Melhem;  AFOSR-89-0469. 

National  Science  Foundation,  10/87-6/89  "Application  to  Use  DARPA/NSF  Service  (MOSIS)  for 
Fabrication  of  Prototype  Quantities  of  Custom  Integrated  Circuits  to  Support  Education”;  $15,200. 
Renewed  6/89-9/90  $14,900,  Renewed  9/90-9/91  $5,940,  Additional  funding  1/91  $6,000,  Renewed 
10/91-9/92  $6,525  (PI). 

Office  of  Naval  Research,  6/87-5/90  "Changes  in  Neuronal  Network  Properties  Induced  by  Learning 
and  Synaptic  Plasticity:  A  Nonlinear  Systems  Approach";  $394,591  ((Cl)  with  T.VV.  Bergpr(PI),  R  .J 
Sclabassi,  G.  Barrionuevo,  D.N.  Krieger).  Supplement:  6/90-5/91  $137,352:  ($8,741  to  Electrical 
Engineering  to  date)  N00014-87-K-0472. 


Completed: 

Air  Force  Office  of  Scientific  Research.  10/88-9/91,  "A  System  Theoretic  Investigation  of  Neuronal 
Network  Properties  of  the  Hippocampal  Formation";  $476,681  ($46,764  to  Electrical  Engineering  to 
date)  (Cl)  with  T.W.  Berger(PI),  R.J.  Sclabassi,  G.  Barrionuevo.  D.N.  Krieger:  AFOSR-890197. 

National  Science  Foundation,  5/91-7//91 .  "A  Research  Experiences  for  Undergraduates  Site: 

Training  Students  to  Use  Computer  Simulations  as  Research  Tools",  $42,000  (('!)  with  A.C.  Balazs 
(PI),  and  J  F  Patzer. 

Viewlogic  Systems.  Inc..  6/91,  "Software  grant:  Workview  750  system  with  7400  simulation 
models"  $13,000  (value)  (PI). 

The  Ben  Franklin  Technology  Center  of  Western  Pennsylvania,  l/91-9/'91,  "Unix  Craphir  U.-^er 
Interface  Development  System",  $117,731  ($31,632  to  Electrical  Engineering)  (CO-PI)  with  J.  G. 
Bonar.  (lUTdanre  Technologies. 
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National  Science  Foundation,  5/88-11/90,  “Instrumentation  and  Laboratory  Improvement:  Real 

Time  Signal  Processing  Laboratory  for  Undergraduate  Instruction”;  868,435  (Cl)  with  L.F.  Chaparro 
(PI),  E.W.  Kamen,  S.  Park;  U XG-8852496. 

National  Science  Foundation,  5/89-4/90,  “Optical  Technology  for  Network  Based  Multiprocessors’; 
$49,983  (CO-PI)  with  D.  Chiarulli,  R.  Melhem;  MIP-8901053. 

Air  Force  Office  of  Scientific  Research,  7/88-7/89,  “Parallel  Memory  Using  Coincident  Optical 
Pulses”;  $50,132  (CO-PI)  with  D,  Chiarulli,  R.  Melhem;  AFOSR-88-0198. 

National  Science  Foundation,  1/88-1/89,  “CISE  Instrumentation:  A  VLSI  Design  and  Test  Facility 
for  the  University  of  Pittsburgh”;  $65,597  (CO-PI)  with  D.  Cniaruiii;  CCR-8716980. 

DARPA/University  of  Massachusetts,  Subcontract,  6/87-10/87,  “Array  Control  l^nit  for  the 
UMclss  Image  Understanding  Architecture”;  $28,069;  10/87-12/88  additional  funds  $14,715; 
8/88-12/88  additional  funds  $8,922  (PI). 

Central  Research  Development  Fund,  University  of  Pittsburgh,  7/87-7/88.  “An  Integrated  Tool 
Set  for  Digital  Systems  Design";  $9,900  (PI). 

Advanced  Research  Projects  Agency/ Army,  9/86-12/88,  “Image  Understanding  Architecture"; 
$1,752,200  (Cl)  with  A.  Hanson,  E.  Riseman,  C.  Weems);  DACA76-86-C0015. 

Advanced  Research  Projects  Agency/ AFOSR,  2/86-2/88,  “Intermediate  Level  Computer  Vision 
Processing  Algorithm  Development  For  Content  Addressable  Array  Parallel  Processor”;  $197,000; 
(Cl)  with  A.  Hanson,  E.  Riseman,  C  Weems;  F49620-86-C-0041. 

Naval  Research  Laboratory,  3/85-10/85,  “Parallel  Algorithms  for  Low,  Intermediate,  and  High  Level 
Image  Understanding  Tasks  Using  the  Content  Addressable  Array  Parallel  Proce.ssor  (CAAPP)"; 
$24,500;  (Cl)  with  A.  Haii.son,  E.  Riseman,  C.  Weems;  N00014-85-K-2008. 

National  Science  Foundation,  6/85-6/86  “Computer  Re.search  Equipment  (Infrastructure)",  $105,413; 
(Cl)  with  D  Carlson  (PI),  W  Kohler,  M.  Krishna,  D.  Pradhan,  A.  Singh,  J.  Stankovic,  D.  Towsley); 
DCR-8505499. 

Pending: 

DARPA/ISTO/Pennsylvania  State  University,  6/90-5/93  “Performance  Driven.  Multi-Level 
Synthesis  Tools  and  Accompanying  Design  Environments”;  $1,900,757;  (CO-PI)  with  D.E.  Setliff, 
and  D..M.  Chiarulli.  Department  of  Computer  Science  and  M.J.  Irwin.  R.M.  Owens  and  B.  Pangrle, 
Penn  State  University 


Vita  of  Rami  G.  Melhem 
Department  of  Computer  Science 
The  University  of  Pittsburgh 
Pittsburgh,  PA  15260. 

(4 12)-624-8426.  melhem@cs.pitt.edu 


GENERAL  INFORMATION: 


Birthdate: 
Birthplace: 
Marital  Status: 
Languages: 


June  30,  1954 
Cairo,  Egypt 
Married,  two  children 
Arabic  and  French 


EDUCATION: 


1983 

Ph.D. 

1981 

M.S. 

1981 

M.A. 

1978 

B.S. 

1976 

B.S. 

Computer  Science,  University  of  Pittsburgh 
Computer  Science,  University  of  Pittsburgh 
Mathematics,  University  of  Pittsburgh 
Mathematics.  Ein-Shams  University,  Cairo,  Egypt 
Electrical  Engineering,  Cairo  University.  Egypt 


PROFESSIONAL  ACTIVITIES: 


Member  in 


Referee  for 


Reviewer  for 

Organizer 

Member 


Chairman 
Guest  editor 


the  IEEE  Computer  Society 

the  As.sociation  for  Computer  Machinery 

The  International  Society  for  Optical  Engineering  -  SPIE. 

IEEE  Computer  Magazine  -  IEEE  Trans,  on  Computers 

IEEE  Trans,  on  Automatic  Control  -  SIAM  Journal  on  Computing 

IEEE  Trans,  on  Parallel  and  Distributed  Computing  -  Disuibuted  Computing 

Parallel  Computing  -  J.  of  Parallel  and  Distributed  Computing 

Journal  of  Computer  and  System  Sciences  -  Computers  and  Structures 

The  International  Journal  of  Parallel  Programming 

The  Intemaucnal  J.  of  Computer  Simulation  -  J.  of  VLSI  Signal  Processing 

The  International  Journal  of  Supercomputer  Applications 

Numerous  conferences  and  symposia 

The  National  Science  Foundation 

Symposium  on  PCGG  methods  and  Supercomputing  -  Pittsburgh.  PA  -  1989. 

Program  committee  -  Int.  Conf.  on  Application  Specific  Array  Processors  -  1991. 
Program  committee  -  ISf'M  Conf.  on  Parallel  and  Disuibuted  Comp.  &  Sys.  -  1991. 
Program  committee  -  Int.  Workshop  on  Defect  and  Fault  Tolerance  in  VLSI  -  1992. 
Program  committee  -  ISMM  Conf.  on  Parallel  and  Distributed  Comp.  &  Sys.  -  1992, 

J.  of  Parallel  and  Distributed  Computing-  Sepecial  issue  on  Optical  Computing  -  1993. 
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PROFESSIONAL  EXPERIENCE: 


1984  Research  Associate,  University  of  Pittsburgh  (January-September) 

1984-1987  Assistant  Professor  of  Computer  Science,  Purdue  University 
(on  leave  from  Sept.  1985  to  Sept.  1987) 


1985- 1986  Visiting  Assistant  Professor  of  Mathematics,  University  of  Pittsburgh 

(September-August) 

1986- 1989  Assistant  Professor  of  Computer  Science,  University  of  Pittsburgh 

(September-August) 

1989-  Associate  Professor  of  Computer  Science,  University  of  Pittsburgh 

(September-) 


GRANT  AWARDS; 

ONR:  "Application  of  Computational  Networks  and  Systolic  Arrays  to  Scientific  Computation". 

With  W.  C.  Rheinboldt.  Total  Award;  S233,4()2.(X).  June  1985  to  September  1988. 

AFOSR:  Investigator,  (C.  Hall  and  T.  Porsching,  principal  Investigators); 

"Computational  Fluid  Dynamics  at  the  Institute  for  Comp.  Math.  &  Applications" 

Total  Award:  S587 .858.00,  June  1984  to  June  1987. 

AFOSR:  "Parallel  Memory  Addressing  Using  Coincident  Optical  Pulses". 

With  D.  Chiarulli  and  S.  Levitan.  Total  Award:  $50,132.00.  July  1988  to  July  1989. 

NSF:  "Optical  Technology  in  Network  Based  Multiprocessors". 

With  D.  Chiarulli  and  S.  Levitan.  Total  Award:  $49,983.00.  July  1989  to  June  1990. 

AFOSR:  "Coincident  Pulse  Techniques  for  Hybrid  Optical-Electronic  Computer  Systems". 

With  D.  Chiarulli  and  S.  Levitan.  Total  Award:  $479,511.00.  August  1989  to  July  1992. 

NSF:  "Bi-level  Reconfigurations  of  Fault  Tolerant  Arrays  in  Bi-modal  Environments". 

Total  Award;  $61,547.00.  September  1989  to  August  1991. 

NSF;  "CISE  Research  Instrumentation  grant  for  the  acquisition  of  an  Intel  Hypercube". 

With  M.L.  Soffa  and  T.  Znati.  Total  Award:  $124,300.00.  March  1990  to  March  1991. 


CURRENT  RESEARCH  INTERESTS: 

Parallel  and  distributed  computing  -  Fault  tolerance  in  large  computational  networks 
Optical  Computing  -  Special  Purpose  Architectures  -  Scientific  Computing 


PATENTS: 

"An  Optical  Selector  Switch",  Co-inventors;  D.  Chiarulli  and  S.  Levitan. 
Patent  number  4,883,334  -  November  28.  1989. 
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PUBLICATIONS  IN  ARCfflVED  JOURNALS 

1)  R.  G.  Melhem  and  W.  C.  Rheinboldt,  "A  Comparison  of  Methods  for  Determining  Turning  Points  of 
Non-linear  Equations",  Computing,  vol.  29,  pp.201-226,  (1982). 

2)  R.  G.  Melhem  and  W.  C.  Rheinboldt,  "A  Mathematical  Mode!  for  the  Verification  of  Systolic  Net¬ 
works",  SIAM  Journal  on  Computing,  vol.  13,  no.  3,  pp.  541-565.  (1984). 

3)  R.  G.  Melhem,  "On  the  Design  of  a  Pipelined/Systolic  Finite  Element  System".  Computers  and  Struc¬ 
tures,  vol.  20.  pp.67-75,  (1985). 

4)  R.  G.  Melhem,  "Formal  Analysis  of  a  Systolic  System  for  Finite  Element  Stiffness  Matrices",  Journal 
of  Computer  and  System  Sciences,  vol.  31.  no.  1.  pp.  1-27,  (1985). 

5)  R.  G.  Melhem,  "A  Study  of  Data  Interlock  in  Computational  Networks  for  Sparse  Matrix  Multiplica¬ 
tion",  IEEE  Transactions  on  Computers,  vol  36,  no  9.  pp.l  101-1 107,  (1987). 

6)  R.  G.  Melhem,  "Parallel  Gauss/Jordan  Elimination  for  the  Solution  of  Dense  Linear  Systems".  Paral¬ 
lel  Computing,  vol  4.  no  3,  pp.339-343.  (1987). 

7)  R.  G.  Melhem,  "Determination  of  Stripe  Structures  for  Finite  Element  Matrices".  St/  VI  'oumal  on 
Numerical  Analysis,  vol  24.  no  6,  pp.1419-1433.  (1987). 

8)  R.  G.  Melhem,  "Toward  Efficient  Implementations  of  PCCG  Methods  on  Vector  Supercomputers". 
The  International  Journal  on  Supercomputer  Applications,  vol  1,  no  1,  pp.71-98,  (1987). 

9)  D.  Chiarulli,  R.  Melhem  and  S.  Levitan,  "Using  Coincident  Optical  Pulses  for  Parallel  Memory 
Addressing",  IEEE  Computer,  vol  20,  no  12,  pp.48-58,  (1987). 

10)  R.  G.  Melhem.  "Verification  of  a  Class  of  Self-timed  Computational  Networks",  BIT,  Vol  27.  no  4 
(1987),  pp.480-500. 

11)  R.  G.  Melhem.  "Parallel  Solution  of  Linear  Systems  with  Striped,  Sparse  Matrices",  Parallel  Comput¬ 
ing,  vol  6,  no  2.  pp.  165-184,  (1988). 

12)  R.  G.  Melhem,  "A  Modified  Frontal  Technique  Suitable  for  Parallel  Systems".  SIAM  J.  on  Scientific 
and  Statistical  Computing,  vol  9,  no  2  (1988),  pp.  289-303. 

13)  K.  Ramarao,  R.  Daley  and  R.  Melhem,  "Message  Complexity  of  the  Set  Intersection  Problem",  Infor¬ 
mation  Processing  Letters,  vol  27,  no  4.  pp.  169- 174  (1988). 

14)  R.  Melhem  and  K.  Ramarao.  "Multicolor  Ordering  of  Sparse  Matrices  Resulting  from  Irregular 
Grids".  ACM  Tran,  on  Mathematical  Software,  vol  14.  no  2.  pp.  117-138  (1988). 

15)  R.  Melhem  and  C.  Guerra.  "The  Application  of  a  Sequence  Notation  to  the  Design  of  Systolic  Com¬ 
putations".  BIT.  vol  29,  no  3.  pp.  409-427  (1989). 

16)  R.  Melhem.  "A  Systolic  Accelerator  for  the  Iterative  Solution  of  Sparse  linear  systems",  IEEE  Trans, 
on  Computers,  vol  38,  no  11,  pp.1591-1595  (1989). 

17)  R.  Melhem,  D.  Chiarulli  and  S.  Levitan,  "Space  Multiplexing  of  Waveguides  in  Optically  Intercon¬ 
nected  Multiprocessor  Systems".  The  Computer  Journal,  vol  32,  no  4.  pp.  362-369  (1989). 

18)  C.  Guerra  and  R.  Melhem,  "Synthesis  of  Systolic  Algorithm  Designs".  Parallel  Computing,  vol  12. 
no.  2.  pp.  195-207  (1989). 

19)  S.P.  Levitan,  D.M.  Chiarulli  and  R.G.  Melhem.  "Coincident  Pulse  Techniques  for  Multiprocessor 
Interconnection  Structures",  Applied  Optics,  vol  29,  no.  14.  pp.  2024-2033.  (1990) 

20)  Y.  Pan  and  R.  Melhem,  "Short  Circuits  in  Buffered  Multi-stage  Interconnection  Networks".  The 
Computer  Journal,  vol  33,  no.  4.  pp.  323-329  (1990). 

21)  R.  Melhem  and  G.  Hwang,  "Embedding  Rectangular  Grids  into  Square  Grids  with  Dilation  Two". 
IEEE  Transactions  on  Computers,  vol.  39.  no.  12,  pp.  1446-1455,  (1990). 

22)  D.  Chiarulli.  S.  Levitan  and  R.  Melhem,  "Optical  Bus  Control  for  Distributed  Multiprocessors".  The 
Journal  of  Parallel  and  Distributed  Computing,  vol.  10.  no.  1,  pp.  45-54  (1990). 

23)  M.  Alam  and  R.  Melhem,  "An  Efficient  Spare  Allocation  Scheme  and  its  Application  to  Fault 
Tolerant  Binary  Hypcrcubes".  IEEE  Trans,  on  Parallel  and  Disunbuted  Systems,  vol  2.  no  1.  pp. 
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117-126  (1991) 

24)  Z.  Guo,  R.  Melhem,  R.  Hall.  S.  Levitan  and  D.  Chiarulli.  "Pipelined  Communication  in  Optically 
Interconnected  Arrays",  Journal  of  Parallel  and  Distributed  Computing,  vol  12,  no  3.  pp.  269-282. 
(1991) 

25)  D.  Chiarulli,  R.  Ditmore.  S.  Levitan  and  R.  Melhem,  "An  All  Optical  Addressing  Circuit:  Experimen¬ 
tal  Results  and  Scalability  Analysis",  IEEE  J.  of  Lightwave  Technology,  vol.  9.  no.  12,  pp.  1717- 
1725,  (1991). 

26)  F.  Provost  and  R.  Melhem.  "A  Distributed  Algorithm  for  Embedding  Trees  in  Hypcrcubes  with 
Modification  for  Run-time  Fault  Tolerance",  Accepted  for  publication  in  the  Journal  of  Parallel  and 
Distributed  Computing. 

27)  C.  Qiao,  R.  Melhem,  S.  Levitan  and  D.  Chiarulli.  "Optical  Multicasting  in  Linear  Arrays",  Accepted 
for  Publication  in  the  International  Journal  of  Optical  Computing 

28)  R.  Melhem,  "Bilevel  Reconfigurations  of  Fault  Tolerant  Arrays".  Accepted  for  publication  in  IEEE 
Trans,  on  Computers. 


PAPERS  IN  REFERRED  CONFERENCE  PROCEEDINGS:  (an  asterisk  indicates  that  the  paper  is  a  prel¬ 
iminary  version  of  one  of  the  above  journal  papers) 

1)  R.  G.  Melhem.  "A  Language  for  the  Simulation  of  Systolic  Architectures",  Proc.  of  The  12th.  Inter¬ 
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Second  Workshop  on  Systolic  Arrays,  Oxford,  U.K.  (1986).  Also  appeared  in  "Systolic  Arrays",  W. 
Moore.  A.  McCabe  and  R.  Urquhart  editors,  Adam-Hilger,  1987. 
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Proc.  of  the  Second  Int.  Conf.  on  Supercomputers.  (1987). 

7) *  R.  G.  Melhem,  "Iterative  Solution  of  Sparse  Linear  Systems  on  Systolic  Arrays",  Proc.  of  the  Interna¬ 
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8)  R.  G.  Melhem,  "Mapping  Algorithms  into  Architectures".  Proc.  of  the  Twenty-First  Annual  Hawaii 
International  Conference  on  System  Sciences,  Vol  I  (1988). 

9)  F.  Provost  and  R.  Melhem.  "Fault  Tolerant  Embedding  of  Binary  Trees  and  Rings  into  Hypercubes". 
Proc.  of  the  International  Workshop  on  Defect  and  Fault  Tolerance  in  VLSI  Systems  (1988). 
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Proc.  of  the  26rd  Allerton  Conf.  on  Computer,  Control  and  Communication  (1988). 

11)  M.  Alam  and  R.  Melhem.  "Fault  Tolerance  and  Reliable  Routing  in  Augmented  Hypercube  Architcc- 
lures",  Proc.  of  the  8th.  IEEE  Phoenix  Conference  on  Computers  and  Communications  (1989). 
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Shared  Memory  Multiproces.sor  Systems",  Proc.  of  the  22nd  Annual  Simulation  Symposium.  (1989). 
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Pulse  Techniques".  Proc.  of  the  SPIE  International  Symposium  on  Advances  in  Interconnections  and 
Packaging".  (1990). 

21) *  D.  Chiarulli.  S.  Levitan  and  R.  Melhem,  "Demonstration  of  an  All  Optical  Addressing  Circuit", 
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of  the  International  Conference  on  Parallel  Processing.  (1991). 

24) *  C.  Qiao.  R.  Melhem,  S.  Levitan  and  D.  Chiarulli.  "Multicasting  in  Optical  Bus  Connected  Processors 

Using  Coincident  Pulse  Techniques",  Proc.  of  the  International  Conference  on  Parallel  Processing. 
(1991). 

25)  T.  Znati,  K.  Pruhs  and  R.  Melhem,  "Dilation  Based  Bidding  Schemes  for  Dynamic  Load  Balancing 
on  Distributed  Processing  Systems",  Proc.  of  the  sixth  Distributed  Memory  Computing  Conference, 
(1991). 

26)  R.  Melhem.  K.  Pruhs  and  T.  Znati.  "Using  Spanning  Trees  for  Balancing  Dynamic  Load  on  Mul¬ 
tiprocessors",  Proc.  of  the  sixth  Distributed  Memory  Computing  Conference.  (1991). 

27)  R.  Melhem  and  John  Ramirez.  "Meshes  with  Rexible  Redundancy”,  Proc.  of  the  second  Workshop 
on  Algorithms  and  Parallel  VLSI  Architectures,  Bonas.  France.  (1991). 

28)  F.  Provost  and  R.  Melhem.  "Embedding  Rings  in  Hypercubes  for  Run-time  Fault  Tolerance",  Proc.  of 
the  fourth  ISMM  Conference  on  Parallel  and  Distributed  Computing  and  Systems.  (1991). 

29)  A.  Varvitsiotis,  S.  Theodoiidis  and  R.  Melhem.  "Mapping  FIR  Filtering  on  Systolic  Rings".  Proc.  of 
the  International  Conf.  on  Application  Specific  Array  Processors,  September  (1991). 

30)  N.  Shrivastava  and  R.  Melhem,  "Efficient  and  Optimal  Fault-to-Sparc  Assignment  in  Doubly  Fault 
Tolerant  Arrays",  Proc.  of  the  IEEE  Int.  Workshop  on  Defect  and  Faults  Tolerance  in  VLSI  Systems. 
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31)  C.  Qiao,  R.  Melhem,  "Time-division  Optical  Communications  in  Multiprocessor  Arrays".  Prrx:.  of  the 
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6.2  Students  Funded  During  Current  Period 

•  Zicheng  Guo  (Ph.  D.)  Thesis  Title:  .Array  Processors  with  Pipelined  Busses  and  Their  Impli¬ 
cation  in  Optically  and  Electronically  Interconnected  Multiprocessor  Architectures.  Expected 
completion  date:  April,  91.  Supported:  Summer  of  1989,  and  all  of  1990. 

•  Chunming  Qiao  (Ph.D.)  Expected  completion  date:  .April  1993.  Supported:  September  1990 

-  present. 

•  David  George  (M.S.)  .April  1991.  Topic:  Synthesis  of  .Asynchronous  Finite  State  .Machines. 
Supported:  Summer  of  1990. 

•  Tom  George  (M.S.)  .April  1991.  Topic:  Extended  Simulation  .Models  for  VHDL.  Supported: 
Summer  of  1990. 

•  James  Tezza  (B.S.)  (M.S)  Expected  completion  date:  August  1992.  Topic:  Power  Distribution 
in  Lossless  Tapped  Erbium  Doped  Fiber  Busses  for  Multiprocessors.  Supported:  Spring  1991 

-  present. 

•  Manoj  Bidnurkar  (M.S.)  Expected  completion  date:  .August  1992  Topic:  Computer  Model  of 
Erbium  Doped  Fiber.  Supported  January  1992  -  present. 

7  Project  Interactions 

7.1  Conferences  and  Workshops 

•  Chiarulli,  Guo.  and  Levitan  attended  and  presented  at  SPIE  Boston/OE‘90.  November  1991. 
Boston,  MA. 

•  Guo  and  Melhem  attended  and  presented  at  Frontiers'90  Conference,  October  1990,  College 
Park,  MD. 

•  Guo  and  Melhem  attended  and  presented  at  Inti.  Conf.  on  Application  Specific  .Array 
Proces.sors,  September  1990,  Princeton,  N.J. 

•  Levitan  and  ChiaruUi  organized  and  chaired  a  panel  discussion  on  "The  Future  of  Optics  in 
Computing"  at  the  November  1991  Supercomputing  Conference.  Qiao  also  presented  a  paper 
at  that  conference. 

•  Melhem  and  Qiao  attended  and  presented  at  the  International  Conference  on  Parallel  Pro¬ 
cessing,  .August.  1991. 

7.2  Invited  Presentations 

•  Chiarulli  and  Levitan  gave  invited  talks  at  University  of  Colorado  at  Boulder.  Optoelectronic 
Computing  Systems  Center,  Boulder.  CO,  February  6,  1990. 

•  Members  of  the  group  have  been  invited  to  participate  in  an  .AFOSR  sponsored  workshop  on 
Reconfigurable  Optical  Interconnects.  Boulder.  CO,  March,  1992. 


7.3 


Other  Interactions 


•  ChiaruUi  and  Melhem  are  Guest  editors  for  a  planned  special  issue  of  tlie  Journal  of  Parallel 
and  Distributed  Computing  on  Optical  Computing. 

•  Levitan  is  on  the  Ph.D.  committee’s  of  Brian  Telfer,  Sanjay  Natarajan  and  John-Scott  Smoke- 
lin,  students  of  Prof.  David  Ca.sasent  from  Carnegie  Mellon  University. 

•  Melhem  is  the  program  chair,  and  Levitan  is  on  the  program  committee  of  the  Fifth  ISSM 
International  Conference  on  Parallel  and  Distributed  Computing  and  Systems,  to  be  held  in 
Pittsburgh,  October,  1992. 

•  Levitan  gave  talks  about  the  group’s  work  at  the  IBM  T.J.  Watson  Research  Center  (Octo¬ 
ber  1991),  and  to  the  Department  of  Electrical  Engineering  at  the  University  of  Pittsburgh 
(January  1991). 

•  ChiaruUi  ran  an  Optical  Computing  graduate  seminar  in  the  Department  of  Computer  Sci¬ 
ence. 

•  Richard  Thompson  has  agreed  to  be  on  the  Ph.D.  committee  of  Chunming  Qiao.  Professor 
Thompson  also  gave  a  talk  at  the  Department  of  Electrical  Engineering. 

•  We  have  been  interacting  with  Dr.  William  Miniscalco  from  GTE  Labs  on  our  work  with 
Erbium  doped  glass  fiber.  GTE  has  given  us  samples  of  doped  fiber  for  our  e.xperirnents. 

8  Project  New  Discoveries 

•  We  have  quantified  the  advantages  and  general  applicability  of  signal  pipelining  for  intercon¬ 
nection  networks. 

•  We  have  resolved  (from  both  a  theoretical  and  a  practical  point  of  view)  the  issue  of  shadows 
in  multi-dimensional  structures. 

•  We  have  realized  the  generalization  of  signal  pipelines  to  TD.M  structures  and  further  to  SDM 
and  WDM  based  networks  as  well. 


9  Project  Evaluation 


We  are  pleased  with  the  accomplishments  of  the  group  to  date.  We  have  made  significant  progress 
with  regards  to  our  fabrication  of  prototype  structures  to  verify  the  applicability  of  our  ideas  to 
multiprocessor  interconnection  networks.  Our  latest  work  in  the  laboratory  on  lossless  tapped 
structures  is  promising. 

We  believe  that  our  generalization  of  coincident  structures  to  pipelined  structures,  and  pipelined 
structures  to  more  general  reconfigurable  structures  will  be  a  significant  contribution  to  the  theory 
and  practice  of  high  speed  multiprocessor  interconnection  networks. 


A  Copies  of  Recent  Papers  from  the  Research  Group 


1.  An  all  Optical  Addressing  Circuit;  Experimental  Results  and  Scalability  Analysis  IEEE  Jour¬ 
nal  of  Lightwave  Technology  9:12,  1991 

2.  Optical  Multicasting  in  Linear  Arrays  International  Journal  on  Optical  Computing,  in  press 

3.  Pipelined  Communications  in  Optically  Interconnected  Arrays  Journal  of  Parallel  and  Dis¬ 
tributed  Computing,  12:3,  1991 

4.  Time- Division  Optical  Communications  in  Multiprocessor  Arrays  Supercomputing '9 1 ,  Pro¬ 
ceedings 
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An  All  Optical  Addressing  Circuit:  Experimental 
Results  and  Scalability  Analysis 

Donald  M.  Chiarulli,  Member,  IEEE.  Robert  M.  Ditmore,  Steven  P.  Levitan,  Member,  IEEE,  and 

Rami  G.  Melhem.  Member.  IEEE 


Abstract— In  this  paper,  we  present  results  from  a  demon¬ 
stration  of  both  single  and  parallel  selection  in  a  one  of  four 
optical  addressing  circuit  operating  at  250  MHz  using  coinci¬ 
dent  pulse  addressing.  We  then  present  an  analysis  of  power 
distribution  in  two  different  tapped  fiber  structures.  Based  on 
our  results,  we  discuss  issues  of  scalability  with  respect  to  syn¬ 
chronization  and  power  distribution  in  larger  systems. 


I.  Introduction 

TWO  properties  of  optical  signals,  unidirectional  prop¬ 
agation  and  predictable  path  delay,  make  it  possible 
to  devise  logic  systems  in  which  information  is  encoded 
as  the  relative  timing  of  two  optical  signals.  Coincident 
pulse  addressing  is  an  example  of  such  a  system.  In  this 
technique,  the  address  of  a  detector  site  is  encoded  as  the 
delay  between  two  optical  pulses  which  traverse  indepen¬ 
dent  optical  paths  to  the  detector.  The  delay  is  encoded  to 
correspond  exactly  to  the  difference  between  the  two  op¬ 
tical  path  lengths.  Thus,  pulse  coincidence,  a  single  pulse 
with  power  equal  to  the  sum  of  the  two  addressing  pulses, 
is  seen  at  the  selected  detector  site.  Other  detectors  along 
the  two  optical  paths  for  which  the  delay  did  not  equal  the 
difference  in  path  length,  detect  both  pulses  indepen¬ 
dently,  separated  in  time. 

Stated  more  formally,  consider  an  optical  fiber  of  length 
L  with  two  optical  pulse  sources,  P\  and  Pi  coupled  to 
each  end.  Each  source  generates  pulses  of  width  t  and 
height  h.  Define  /  =  tc,  where  is  the  speed  of  light  in 
the  fiber.  In  other  words  /  is  the  length  of  fiber  corre¬ 
sponding  to  the  pulse  width.  Using  2x2  passive  cou¬ 
plers,  n  detectors,  labeled  D|  through  D„.  are  placed  in 
the  fiber  with  the  two  tap  fibers  from  each  coupler  cut  to 
equal  lengths  and  joined  at  the  detector  site.  The  location 
of  each  coupler/detector  is  carefully  measured  so  that  the 
ith  detector  is  located  M  (L  -  (n  +  l)f)/2  -t-  il,  from  the 
left  end  of  the  bus.  The  optical  bus  in  the  center  of  Fig. 
1  shows  such  an  arrangement  for  n  =  3.  To  uniquely  ad¬ 
dress  any  detector,  a  specific  delay  between  the  pulses 
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generated  by  P)  and  Ps  is  chosen.  If  this  delay  is  (n  -  2i 
+  1)t,  the  two  pulses  will  be  coincident  at  detector  D,. 

The  same  technique  can  be  generalized  to  support  par¬ 
allel  selections.  If  the  P,  source  generates  a  single  pulse 
at  time  r,  and  the  source  Ps  generates  a  series  of  pulses  at 
times  fj,  j  e  { 1 ,  •  •  •  .  n}  with  each  r,  timed  relative  to  t„ 
then,  according  to  the  addressing  equation  above,  to  se¬ 
lect  a  specific  detector  i  each  t,  will  be  in  the  range  -{n 
-  1)t  <  f,.  -  r,  <  (n  -  1)t.  Therefore,  any  or  all  of  the 
i  detectors  can  be  uniquely  addressed  by  a  positionally 
distinguishable  pulse  from  source  Pi-  For  convenience, 
this  pulse  train  is  referred  to  as  the  select  pulse  train  and 
the  single  pulse  emanating  from  P|  is  called  the  reference 
pulse.  Since  the  length  of  the  select  pulse  train  is  n,  and 
each  pulse  in  the  return  to  zero  encoding  is  separated  by 
2t,  it  follows  that  the  system  latency,  a  =  Inr.  Further, 
up  to  n  locations  may  be  selected  in  parallel  within  a  sin¬ 
gle  latency  period.  Therefore,  the  system  throughput  is  v 
=  l/2r. 

In  previous  papers,  we  have  discussed  the  general  ap¬ 
plication  of  coincident  pulse  techniques  to  both  memory 
addressing  and  multiprocessor  network  applications  [2], 
[3],  [7],  [8].  In  this  paper,  we  emphasize  the  practical 
limits  on  the  applicability  of  this  technique  for  large  sys¬ 
tems.  In  order  to  design  large  scale  computer  systems,  we 
need  to  know  the  realistic  limits  on  the  speed,  size,  and 
cost  of  such  systems.  Our  long  term  goal  is  to  build  high¬ 
speed  multiprocessor  interconnection  networks  using  off 
the  shelf  optical  components  and  tapped  fiber  busses. 

Tapped  fiber  busses,  those  with  one  or  more  transmitter 
and  multiple  receivers,  have  been  less  widely  adopted  than 
simple  point-to-point  fibers,  primarily  because  of  seal- 
ability  limits  based  on  power  distribution  (9).  However, 
the  recent  development  of  low  ratio  passive  couplers  [5] 
and  the  prospect  of  fiber  based  optical  amplifiers  [4],  [6] 
suggest  a  closer  examination  of  the  power  distribution 
problem.  Therefore,  we  have  constructed  a  prototype  sys¬ 
tem  for  conducting  experiments  from  which  we  can  ex¬ 
trapolate  reasonable  limits  on  the  speed  and  size  of  prac¬ 
tical  multicomputer  systems. 

In  this  paper,  we  first  present  results  from  two  labora¬ 
tory  experiments  on  a  prototype  coincident  pulse  address¬ 
ing  system.  The  two  questions  to  be  answered  by  the  ex- 
penments  are:  how  do  synchronization  error  and  power 
loss  effect  the  scalability  of  such  systems.  Therefore,  the 
first  experimental  is  an  examination  of  the  coincident  pulse 
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power,  as  a  function  of  the  synchronization  of  the  arriving 
pulses.  The  second  experiment  demonstrates  our  ability 
to  select  arbitrary  subsets  of  the  detectors  with  a  select 
pulse  train  operating  at  250  MHz.  The  ability  to  perform 
selections  of  multiple  detectors  is  key  to  various  computer 
system  applications  we  have  investigated.  However,  the 
second  experiment  highlights  a  more  fundamental  prob¬ 
lem:  the  power  loss  due  to  the  tapping  couplers  on  the  bus 
diminishes  the  ratio  of  coincident  to  noncoincident  pulse 
heights  for  long  bus  structures.  The  two  experiments  con¬ 
trast  the  temporal  and  physical  scalability  of  coincident 
pulse  systems  and  show  that  the  dominant  effects  are.  in 
fact,  power  distribution  limits  on  the  physical  scale  of 
these  systems. 

Section  III  expands  on  the  power  distribution  issue  with 
an  analysis  of  power  distribution  in  two  tapped  fiber  net¬ 
work  structures.  The  first  is  the  same  linear  structure  that 
we  use  in  our  experiments.  The  second  is  a  dual-level 
structure  that  consists  of  a  main  fiber  and  a  series  of  sec¬ 
ondary  distribution  fibers  from  which  power  is  tapped. 
We  conclude  with  a  discussion  of  the  implications  of  these 
findings  to  the  construction  of  large  systems. 

11.  Experimental  Results 

Fig.  1  is  a  diagram  of  the  prototype  structure  The  fiber 
bus  consists  of  a  length  of  multimode  fiber  tapped  three 
times  using  Gould  10-dB  fiber  couplers.  Select  and  ref¬ 
erence  bit  patterns  are  generated  by  modulating  the  4-ns 
pulse  output  of  a  Tektronix  PG502  pulse  generator,  shown 
in  the  diagram  as  clock,  with  the  output  of  two  ECL  shift 
registers,  one  for  select,  one  for  rekrence.  at  gates  G2 
and  G3.  Gates  G1  and  G4  simultaneously  hold  the  diode 
current  for  laser  diodes  PI  and  P2  respectively  at  thresh¬ 
old.  while  the  outputs  of  G2  and  G4  generate  modulation 
current.  The  result  is  two.  4-bit.  return  to  zero  bit  streams, 
which  encode  the  information  in  each  of  the  shift  regis¬ 
ters.  As  explained  above,  this  allows  us  to  select  any  sub¬ 
set  of  the  three  (and  in  later  experiments  four)  detectors. 
The  use  of  two  shift  registers  allows  us  flexibility  in  the 
positioning  of  the  reference  pulse  relative  to  the  select 
pulse  train. 

/4.  Pulse  Synchronization 

In  our  first  experiment,  measurements  were  made  to 
characterize  the  effect  of  synchronization  error  between 
the  reference  and  select  pulses  on  the  power  of  the  coin¬ 
cident  pulse.  Since  this  error  can  be  characterized  as  a 
percentage  of  the  pulse  width,  synchronization  precision 
has  a  direct  bearing  on  the  absolute  width  and  height  of 
an  addressing  pulse  that  can  be  effectively  detected 

A  coincident  pulse  structure  with  three  detectors  was 
used,  as  shown  in  Fig  1 .  This  allowed  detector  O;  to  be 
located  in  the  center  of  the  bus  resulting  in  exactly  equal 
noncoincident  pulse  heights,  as  shown  in  the  oscilloscope 
trace  of  Fig  2 

The  reference  and  select  pulse  trains  were  configured 
to  select  In  each  step  of  the  expenment.  synchroniz- 


Fig.  1  Synchronization  expenmenc. 


tion  error  was  introduced  by  adding  successively  longer 
lengths  of  fiber  to  the  ends  of  the  bus.  Length  was  added 
first  on  the  reference  pulse  end  of  the  bus.  and  then  on  the 
select  pulse  end  of  the  bus. 

Fig.  3  shows  the  reduction  factor.  /.  of  the  coincident 
pulse  power  as  a  function  of  percent  synchronization  er¬ 
ror.  Percent  synchronization  error  is  the  error,  in  time 
units,  introduced  by  each  length  of  fiber  divided  by  the 
pulse  width.  In  other  words,  pulses  at  perfect  coincidence 
(synchronization  error  =  0)  yield  a  reduction  factor  of  / 
=  l.O.  which  implies  a  coincident  power  equal  to  twice 
the  single  pulse  power. 

Synchronization  error  in  either  the  select  pulse,  shown 
as  positive  error,  or  the  reference  pulse,  shown  as  nega¬ 
tive  error,  reduces  this  power  by  the  factors  shown.  The 
solid  line  in  Fig.  3  is  the  experimental  result.  The  dotted 
line  is  an  analytical  result  generated  from  the  coincidence 
of  two  sinusoidal  pulse  waveforms.  In  both  cases,  the 
power  falls  off  in  roughly  the  shape  of  the  coincident 
waveforms  themselves. 

In  order  to  analyze  this  result,  we  must  consider  the 
sources  of  synchronization  error  Assuming  that  manufac¬ 
turing  tolerances  for  electronic  components  and  errors  in 
fiber  length  measurements  can  be  compensated  for  by  tun¬ 
ing  the  system,  the  primary  sources  of  synchronization 
error  will  be  thermal  vanations  in  both  the  optical  char- 
actenstics  of  the  fiber  and  in  the  performance  of  electronic 
components  as  well  as  any  jitter  introduced  by  the  elec- 
tncal  clock  generators.  For  the  former,  recent  results  [10| 
have  shown  that  the  vanability  of  the  index  of  refraction 
of  the  fiber  versus  temperature  is  on  the  order  of  40 
ps/km-degree  C.  and  that  this  is  the  dominant  tempera- 


CHIARULLI  ft  ai  AN  ALL  OPTICAL  ADDRESSING  CIRCUIT 


r\^ 


Synchronization  Error 
Fig.  3.  Synchronizaiion  error  reduction  factor 


ture  effect.  This  represents  a  very  minor  variation  in  ef¬ 
fective  optical  path  length.  Obviously,  jitter  and  thermal 
effects  in  the  electronics  will  be  the  predominant  sources 
of  synchronization  error. 

However,  from  Fig.  3  we  can  see  that  a  timing  syn¬ 
chronization  error  of  up  to  50%  only  decreases  the  coin¬ 
cident  pulse  power  to  about  70%  of  its  ideal  value.  There¬ 
fore,  large  variations  (on  the  order  of  one  half  of  a  pulse 
width)  in  electronic  pulse  generation  can  be  tolerated 
without  significant  degradation  of  the  coincident  signal. 
This  result  characterizes  a  temporal  limit  on  scalability, 
based  on  a  limit  of  achievable  pulse  widths.  Timing  errors 
of  several  hundred  picoseconds  are  tolerable  in  gigahertz 
systems.  Therefore,  using  off  the  shelf  components  op¬ 
erating  in  the  one  gigahertz  range,  r  =  1  ns  and  a  system 
throughput  of  »»  =  l/2r  =  500  x  10*  addressing  opera¬ 
tions  per  second  is  feasible. 

The  other  primary  limit,  which  we  need  to  address,  is 
optical  power  distribution.  Since  we  are  using  a  passive 
bus  structure,  the  optical  signals  are  not  amplified  at  any 
point  on  the  bus.  Therefore,  sufficient  optical  power  must 
be  available  at  each  detector  to  individually  discriminate 
coincidence  from  noncoincidence  in  the  presence  of  se¬ 
lection  pulses  for  other  detectors  and  noise.  This  is  the 
subject  of  the  second  experiment. 

B.  Coincident  Pulse  Power 

Our  second  set  of  experiments  were  used  to  character¬ 
ize  the  effect  of  detector  position  on  the  available  coin¬ 
cident  pulse  power.  A  similar  experimental  setup  was 
used,  this  time  with  four  detectors,  as  shown  in  Fig.  4. 

Figs.  5-8  show  the  output  waveforms  for  detectors  D1 
and  D3  for  various  selection  patterns.  Note  that  for  each 
selection  pattern  (pair  of  waveforms)  the  experimental 
equipment  was  adjusted  so  that  the  absolute  values  of 
pulse  heights  for  different  selection  patterns  varied  Figs 
5  and  6  show  coincident  and  noncoincident  waveforms  at 
detectors  Dl  and  D3,  respectively.  Note  that  in  both 
cases,  the  noncoincident  waveforms  (shown  in  (b))  are  of 
unequal  power.  This  is  due  to  the  fact  that  each  pulse  has 
pas.sed  through  a  different  number  of  couplers  and.  hence. 


Fig  4  Deiecior  power  expenmeni 


(a) 


Fig  5  <a)  Selection  of  Dl  measured  at  DI .  (bl  selection  of  D3  measured 

at  Dl. 

has  become  attenuated  to  different  levels.  This  clearly 
shows  that  the  relative  power  between  coincident  and 
noncoincident  pulses  is  a  function  of  the  detector  loca¬ 
tion. 

Figs.  7  and  8  are  examples  of  parallel  selections.  The 
waveform  in  Fig.  7(a)  shows  a  parallel  selection  wave¬ 
form  at  detector  site  for  the  selection  of  three  detec¬ 
tors.  including  Dy.  This  incident  waveform  peak  is  com¬ 
parable  to  the  noncoincident  waveform,  in  Fig.  7(b),  in 
which  Dy  has  been  removed  from  the  set  of  selected  lo¬ 
cations.  Similarly  Fig.  8  shows  parallel  selection  of  all 
four  detectors  at  sites  Z>|  and  Dy. 

To  quantify  the  power  degradation  that  we  observed  in 
these  experiments,  we  define  the  amount  of  additional 
power  in  a  coincident  pulse  relative  to  the  largest  non¬ 
coincident  pulse  seen  by  a  detector  as  the  power  margin. 
Pm  This  IS  given  as  a  fraction  of  the  maximum  noncoin- 
cident  pulse  power; 
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Fig  6  (a)  Selection  of  D3  measured  at  D3.  (b)  selection  of  D1  measured 

at  D3 

Pm  =  [p^  +  P2  -  max  (p^,  p2))/max  (p,,  p.) 

=  min  (pi.  p2)/max  (pi,  Pj).  (I) 

Pm  indicates  the  threshold  level  needed  for  a  detector 
to  discriminate  between  coincident  and  noncoincident 
pulses.  That  is.  for  each  detector  on  the  bus  the  threshold 
should  be  set  to  be  at; 

((Pm  +  I)  X  max  (pi,  p2))/2. 

Pm  has  its  maximum  value.  Pm  =  I,  at  the  center  of  the 
bus,  where  each  pulse  is  at  equal  power,  and  coincidence 
is  reflected  as  a  doubling  of  power  seen  by  the  detector. 
It  is  at  its  minimum  value  at  the  ends  of  the  bus.  For  all 
the  selection  experiments  shown  in  Figs.  5  through  8,  the 
power  margin  is  in  excess  of  30%.  That  is.  coincident 
power  is  greater  than  130%  of  peak  single  pulse  power. 
This  is  measured  at  D|,  which  is  the  leftmost  detector  on 
the  bus. 

In  the  next  section  we  discuss  the  implications  of  power 
margin  on  scalability  issues. 

III.  Analytical  Study  of  Power  Distribution 

In  this  section,  we  present  an  analysis  of  power  distri¬ 
bution  in  each  of  two  tapped  fiber  network  structures.  The 
first  is  a  simple  linear  structure  with  a  single  backbone 
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Fig  7  (a)  Selection  of  Dl .  D2.  D3  measured  at  D3.  (b)  selection  of  Dl . 
D2  measured  at  D3. 

and  a  series  of  passive  coupler  taps,  as  used  in  the  exper¬ 
iment  above.  The  second  is  a  dual  level  structure,  which 
consists  of  a  backbone  fiber  and  a  series  of  secondary  dis¬ 
tribution  fibers  from  which  power  is  tapped. 

In  this  analysis,  we  use  passive,  bidirectional,  2x2, 
symmetric  fiber  couplers  as  shown  in  Fig.  9  [1],  [5]. 
These  are  identical  to  the  couplers  we  used  in  our  previ¬ 
ously  discussed  experiments,  except  that  in  our  analysis 
we  assume  no  excess  loss  in  the  couplers.  Since  the  cou¬ 
plers  are  bidirectional,  we  arbitrarily  let  2I,  B  be  the  input 
ports  and  A',  B'  be  the  output  ports.  Equation  (2)  shows 
power  distribution  from  the  input  to  the  output: 


where  r  is  the  coupling  ratio.  Using  these  couplers,  we 
now  discuss  the  linear  and  dual  level  structures 

A.  The  Linear  Structure 

As  is  shown  in  Fig.  10,  a  linear  bus  consists  of  n  de¬ 
tectors  (and  n  couplers).  Assuming  two,  unit  height, 
pulses  starting  at  opposite  ends  of  the  bus,  and  one  type 
of  coupler  with  a  ratio  of  r,  the  optical  power  from  each 
pulse  pi  and  pi  at  detector  D,  is  given  by  the  equations: 

p!  =  /'-"(l  -  r). 


p;  =  -  r).  (3) 
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Fig  8  (a)  Selection  of  Dl,  D2.  D3.  D4  measured  at  detector  Dl,  (b) 
selection  of  Dl .  D2.  D3,  D4  measured  at  detector  D3 


B 


Fig  9  Symmetric  fiber  coupler 


Fig  10  Linear  optical  bus 


Since  the  bus  is  symmetrical,  we  can  analyze  one  signal 
that  originates  on  the  left  from  a  single  transmitter  and 
propagates  to  the  right  as  shown  in  Fig.  10. 

Fig.  11  is  a  plot  of  p]  versus  i  for  various  values  of  r. 
Note  that  the  values  of  /  are  plotted  on  a  logarithmic  scale. 
The  topmost  curve  is  for  a  bus  with  r  =  90%  where  the 
power  at  the  first  detector  is  10%  of  the  initial  power  The 
lowest  curve  is  for  a  bus  with  r  =  99%  where  power  at 
the  first  detector  is  1  %  of  the  initial  pulse  power  For  all 


Detector  Index 

Fig  11  Power  p,'  al  detector  D,  for  90%  S  r  <  99% 


the  curves,  the  absolute  power  falls  off  geometrically  with 
increasing  i.  1  <  i  <  n. 

A  bound  on  the  number  of  detectors,  n  is  determined 
by  the  sensitivity  of  the  last  detector  on  the  bus.  In  other 
words,  it  is  the  bound  for  a  detector  to  discriminate  be¬ 
tween  “no  pulse”  and  “pulse.”  If  the  last  detector  has  a 
sensitivity  Pmin,  then  the  maximum  number  of  detectors 
supportable  is 


n 


log 


F*min  ^ 

T^r) 


log  (r) 


-F  1. 


(4) 


Equation  (4)  is  shown  graphically  in  Fig.  12  for  a  set 
of  coupling  ratios  r  =  90%.  95%.  97%.  98%.  99%,  and 
0.01%  <  Pmin  <  1%  of  the  input  power  on  a  logarith¬ 
mic  scale.  This  graph  confirms  the  intuition  that  by  im¬ 
proving  either  the  coupling  ratio  r,  or  the  sensitivity  of 
the  detectors  Pmin.  we  will  be  able  to  support  more  de¬ 
tectors  on  the  bus.  We  also  note  the  sharp  drop  in  n  for 
high  values  of  Pmin  and  r.  which  reflects  the  situation 
where  much  of  the  available  power  flows  off  the  end  of 
the  bus  and  is  wasted. 

However,  for  our  experimental  setup,  it  is  cUar  that  it 
is  not  the  absolute  power  but  rather  the  power  margin  that 
imposes  a  bound  on  the  size  of  the  system.  In  addition, 
since  the  bus  configuration  chosen  for  this  structure  re¬ 
quires  bidirectional  propagation,  we  are  constrained  to  use 
a  single  tapping  ratio,  r.  for  all  couplers.  Ba.sed  on  these 
two  constants,  the  graph  shown  in  Fig.  13.  which  is  a  plot 
of  worst-case  power  margin  Pm  versus  1  -  r  for  various 
bus  lengths,  confirms  that  the  power  margin  for  the  coin¬ 
cident  structure  bounds  scalability  more  strongly  than  ab¬ 
solute  power.  We  can  see  from  Fig.  12  that  using  com¬ 
mercially  available  95%  couplers,  and  assuming  we  can 
tolerate  a  Pmin  of  0.0001  of  input  power  we  could  achieve 
bus  lengths  of  abut  120  detectors.  This  would  be  the  case 
for  an  input  of  100  mW  of  power  injected  into  the  bus. 
and  a  detector  sensitivity  of  10  jiW.  operating  at  250 
MHz.  However.  Fig.  13  shows  that  for  a  power  margin 
of  Pm  =  20%  we  could  only  reach  lengths  of  32  detec- 
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Fig  14  Dual-level  optical  bus 


Fig  12  Number  of  detectors  versus /’min  for  vanous  values  off 


Fig  13  Power  margin  Pm  versus  0  001  2  (1  -  r)  2  I  for  vanous  bus 

sizes 


tors.  Therefore,  due  to  both  minimum  power  constraints 
and  power  margin  issues  the  system  scale  is  hig‘’ly  sen¬ 
sitive  to  the  fixed  value  of  r.  Further,  we  note  that  power 
margin  imposes  a  tighter  constraint  than  absolute  power. 

To  help  alleviate  this  problem,  we  propose  a  two  level 
bus  structure.  By  using  two  levels,  we  can  essentially  in¬ 
crease  the  tapping  ratios  on  our  buses  and  more  effectively 
control  the  amount  of  power  at  each  detector. 

B.  The  Dual-Level  Structure 

The  basis  for  the  power  distribution  problem  in  the  lin¬ 
ear  system  is  the  facr  that  detectors  at  the  start  of  the  bus 
use  more  power  than  needed  and.  therefore,  detectors  at 
the  end  of  the  bus  are  starved.  If  we  were  to  relax  the 
requirement  of  fixed  ratio  taps  in  favor  of  varying  the  cou¬ 
pling  ratios,  we  would  need  a  number  of  distinct,  pre¬ 
cisely  tuned  couplers  approaching  the  number  of  detector 
sites  [9).  Yet.  no  couplers  exist  that  would  allow  tuning 
to  a  precision  of  more  than  one  or  two  percent.  Of  course, 
the  use  of  tuned  couplers  forces  the  network  to  be  uni¬ 
directional  since  coupling  ratios  must  decrease  in  the  di¬ 
rection  of  propagation. 


An  alternative  method  that  does  not  require  multir'e 
coupling  ratios  is  to  adopt  a  dual-level  bus  structure.  As 
shown  in  Fig.  14,  we  split  the  bus  into  a  main  fiber  and  a 
sublevel  to  create  a  section  of  the  bus,  labeled  m.  The 
sublevel  contains  m  detectors  in  a  linear  arrangement  ex¬ 
cept  for  the  last  detector,  which  feeds  back  the  remaining 
power  into  the  main  fiber  and  the  next  section.  In  the  main 
fiber,  r-are  must  be  taken  to  ensure  that  the  optical  path 
length  is  the  same  as  the  subsection  so  that  the  two  parts 
of  the  signal  arrive  synchronized  at  the  next  section.  The 
dual-level  bus  consists  of  a  series  of  these  sections. 

Once  again,  we  start  with  an  analysis  of  absolute  power 
for  this  structure,  and  then  proceed  to  power  margin  is¬ 
sues.  Thus,  we  assume  the  input  is  from  the  left  (into  the 
upper  leg  to  the  first  coupler)  and  propagates  to  the  right. 
The  detectors  are  numbered  linearly  in  the  direction  of 
propagation. 

We  further  assume  two  types  of  couplers  with  splitting 
ratios  of  r  and  s  for  the  main  level  and  sublevel,  respec¬ 
tively.  The  power  at  any  given  detector  site  in  Fig.  14  is 
given  by 


P.  =  0  -  r 


(5) 


where  p,  is  the  power  at  site  i,  r,  and  s  are  coupling  ratios. 
k  is  i  div  m,  /  is  i  mod  m.  m  is  the  number  of  detectors  in 
a  sublevel,  and 


From  linear  algebra  (11).  we  know  that  a  vector  of  the 
form  M*  =  can  be  rewritten  as  u*  =  c,X,*.r,. 

where  X,  are  the  eigenvalues  of  matnx  A.  the  x,'s  are  the 
associated  eigenvectors  and  the  coefficients  c,  are  deter¬ 
mined  from  the  initial  condition  u„. 

For  our  analysis,  we  rewrite  the  matnx  of  (5)  in  the 
form 


c,  X^.r,  -F  c^Xlxi 


(7) 


and  the  coefficients  are  determined  by  Ci.r,  -f  gxi:;  =  u„. 
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The  jc,  are  vectors  and  they  are  given  by  x,  -  (V' 
writing  the  coefficient  equation  gives 


which  has  the  solution 

1 


Assuming,  without  loss  of  generality,  that  X,  >  X,  as 
k  increases,  then  the  C|'‘c|  term  in  (7)  quickly  dominates. 
Therefore,  a  good  approximation  is  given  by 

P,  =  (d  -  r)  r]C|Xtr:s'(l  -  s) .  (8) 

Fig.  15  shows  a  companson  of  a  linear  bus  and  a  dual¬ 
level  structure  for  the  particular  case  of  r  =  s  =  90%,  n 
=  256,  and  m  =  sqrt(n).  Clearly,  the  power  at  the  detec¬ 
tors  for  the  linear  bus  falls  off  much  more  rapidly  than  for 
the  dual-level  bus.  The  dual-level  bus  shows  a  character¬ 
istic  "saw-tooth”  pattern  of  power  distribution.  At  the 
beginning  of  each  section,  power  is  restored  by  injection 
of  power  from  the  main  backbone.  This  more  evenly  dis¬ 
tributes  all  of  the  available  power  down  the  length  of  the 
bus. 

In  the  linear  structure,  we  examined  the  bounds  for  the 
minimum  power  needed  at  the  last  detector.  For  the  dual¬ 
level  structure,  we  will  examine  the  minimum  power  seen 
at  the  last  detector  of  the  last  section.  This  minimum 
power  is  given  by  the  equation 

Pmin  =  X{(?,(1  -  r)  +  r)c,s”''(\  -  s).  (9) 

As  with  the  linear  case,  the  ability  to  support  large  sys¬ 
tems  is  dependent  upon  maximizing  the  values  of  r  and  s. 
However,  in  the  dual-level  case,  we  additionally  may  vary 
m,  the  number  of  detectors  per  section.  The  relationship 
between  r,  s.  and  m  is  captured  in  X,,  which  is  a  mono- 
tonically  increasing  function  of  r,  and  s  but  is  not  mono¬ 
tonic  in  m.  Therefore,  it  is  desirable  to  fix  r  and  s  to  be 
as  large  as  possible  and  adjust  m  to  maximize  the  total 
number  of  detectors  in  the  system. 

This  relationship  is  shown  in  Fig  16.  The  two  families 
of  curves  represent  coupling  ratios  of  r  =  5  =  90%  and 
r  =  5  =  95%.  The  curves  are  the  number  of  detectors 
(length  of  the  bus)  supportable  at  different  Pmin  values. 
For  the  90%  curves,  Pmin  =  O.OOOl.  0.0002  .  0.0004. 
0.0008,  0.0016.  0.0032.  and  0.0064.  For  the  95%  curves. 
Pmin  =  0.0001.  0.0002.  0.0004  ,  0.0008.  and  0.0016. 
Note  that  the  dual-level  structure  with  95%  couplers  can¬ 
not  support  high  minimum  power  detectors  since  the 
power  into  the  first  detector  p,  =  0.05  x  0.05  =  0.0025. 
The  long  tails  on  the  curves  reflect  the  condition  where  m 
2  n. 

Having  chosen  values  for  r.  s.  and  m.  we  can  rewrite 
equation  (9)  to  compute  the  number  of  detectors  support¬ 
able  as  a  function  of  Pmm: 
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n  =  m 


log  ( X|) 


(10) 


A  plot  of  numerical  solutions  for  (10)  is  shown  in  Fig. 


17. 


Equation  ( 10)  and  Fig  17  allow  a  direct  comparison  of 
the  dual-level  bus  performance  shown  in  Fig.  17  with  lin¬ 
ear  bus  performance  derived  in  (4)  and  plotted  in  Fig.  12. 
From  this  comparison,  we  can  see  that,  in  terms  of  Pmin. 
the  optimized  dual  level  bus  gives  approximate  factors  of 
between  4  and  10  improvement  (depending  on  the  coupler 
ratios)  over  the  simple  linear  configuration 
To  perform  the  analysis  of  power  margin  for  the  two 
level  structure,  we  compare  the  maximum  power  at  any 
detector  to  the  minimum  power  at  any  detector  on  the  bus. 
This  simplifies  the  calculation  and  ,ives  a  bound  on  the 
"envelope"  of  the  saw-tooth  power  curve  (as  shown  in 
Fig.  15).  For  these  curves  (shown  in  Fig.  18)  we  are  again 
using  w  =  sqrt(n)  and  r  =  s.  Unlike  the  curves  for  the 
linear  bus.  these  curves  have  a  peak  and  approach  an 
asymptotic  value  for  very  large  values  of  r  and  .r.  This  is 
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Minimum  Detectable  Power:  Pmin 
Fig.  17  Pmin  versus  n  for  different  values  of  r.  s. 


because,  similar  lo  the  linear  case,  we  reach  a  point  at 
which  a  significant  percentage  of  the  power  must  be 
thrown  away  at  the  end  of  the  bus,  in  order  to  account  for 
the  large  coupling  ratio  of  the  final  tap.  However,  it  is 
still  the  case  that  Pm,  the  power  maximum  margin,  limits 
the  scalability  of  the  system  more  tightly  than  absolute 
power.  As  a  practical  example  similar  to  the  linear  case, 
using  available  95%  percent  couplers,  the  power  margin 
limits  bus  size  to  about  300  detectors,  rather  than  the  1250 
detectors  we  could  expect  based  on  minimum  power  re¬ 
quirements  of  Pmin  =  0.0001 . 

IV.  Summary 

Clearly,  three  factors,  threshold  power  margin,  syn¬ 
chronization  error,  and  coupling  ratio  determine  system 
scale.  Our  experiments  have  shown  that  the  important 
system  issues  of  latency  and  throughput  which  are  related 
to  pulse  width  limits  are  highly  scalable.  Based  on  current 
and  near  term  technology,  we  have  shown  that  synchro¬ 
nization  error  does  not  contnbute  significantly  to  the 
bounds  calculated  above. 

On  the  other  hand,  physical  scalability  issues  such  as 
the  size  of  the  bus  and  the  number  of  detectors  that  can 


be  supported  are  more  severely  restricted  due  to  power 
distribution  in  a  system  built  from  passive  couplers.  How¬ 
ever,  we  believe  near  term  technologies  (e.g.,  fiber  am¬ 
plifiers)  and  alternate  bus  structures  will  alleviate  this 
problem.  The  fact  that  the  temporal  scalability  bounds 
show  significantly  shorter  pulses  can  be  supported,  is  very 
encouraging  for  the  long-term  application  of  this  tech¬ 
nique. 
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y,  SUMMARY 

;]  [n  this  paper,  we  use  coincident  pulse  techniques  to  implement  multicasting  among  processors 
2  connected  by  optical  buses.  First,  we  docu-.s  two  haste  models  of  a  unary  addressing  implementation. 
S'  To  reduce  addressing  latency  and  overcome  system  size  limif.s,  we  propose  a  two-level  addressing 
implementation  in  which  multicasting  introduces  the  problem  of  possibly  addressing  unintended 
processois  (called  shadows).  We  show  how  additional  addressing  pulses  can  be  used  to  reduce  these 
ys  shadoivs  For  regular  multicasting  patterns  such  as  those  often  found  in  image  processing  and  scientific 
n  applications,  a  shadow-free  partition  of  the  group  to  be  multicasted  can  be  systematically  constructed. 
s>  For  arbitrary  multicasting  patterns,  a  simple,  incremental  p.irtitioning  algorithm  is  introduced.  In 
summary,  the  two-level  addressing  implementation  results  in  higher  efficiency,  lower  minimum  optical 
»i  path  requirements  and  potentially  large  speed-ups  over  the  unary  addressing 

.M  KEYWORDS  .Multicasting  Coincident  pulse  addressing  Optical  waveguides  Shadow-free  par- 
i:  tition 

V* 

U  1.  INTRODUCTION 

>5  Coincident  pulse  techniques  are  based  on  two  ptopenies  of  optical  pulse  transmission, 

w  namely  unidirectional  propagation  and  predictable  propagation  delay  per  unit  length.  The 

n  technique  was  first  introduced  in  the  contc.xt  of  parallel  memory  addressing  but  was  also 

»  applied  to  multiprocessor  interconnection  structures.’  ’'  In  this  paper,  coincident  pulse 

n  techniques  will  be  applied  as  an  addressing  mechanism  for  multicasting  among  optical  bus 

«  connected  processors. 

<1  In  Section  2.  we  first  review  coincident  pulse  techniques  as  addressing  mechanisms  in  two 
i?  models  of  optical  bus  connected  multiprocessor  systems.  We  then  show  how  multicastings 
can  be  implemented  using  unary  addressing.  In  Section  3.  two-level  addressing  is  proposed 
u  to  reduce  the  addressing  latency  and  overcome  the  system  size  limit  imposed  by  the  unarv 

.<  addressing.  However,  multicasting  with  two-level  addressing  introduces  the  problem  of 

•  possibly  addressing  unintended  processors  (called  shadows).  Simulation  results  of  sha  jw 

<-  reduction  using  additional  addressing  pulses  are  given.  In  Section  4.  we  show  how  shadows 

«  can  be  avoided  by  partitioning  the  group  to  be  multicasted  into  shadow-free  (SF)  subgroups. 

»  We  also  show  that  speed-ups  over  the  unary  addressing  can  be  achieved  when  multicasting 

■«  using  two-level  addressing.  FinalK.  we  conclude  the  paper  in  Section  5. 


_ 

<*.•  t  A  preliminary  >hori  version  of  ihis  paper  appean  in  :he  proceedings  of  1991  Iniernaiional  Conkrence  on  Parallel 
“■1  Processing 
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(a)  Foliled  Bua 


(b)  Duai  Bus 


Figure  1.  Two  basic  models 


„  2.  COINCIDENT  PULSE  ADDRESSING 

L’  As  an  introduction  to  using  coincident  pulse  techniques  as  an  addressing  mechanism,  we 

jj  discuss  two  models  of  mu! •; processor  systems  in  which  processors  are  connected  by  optical 

«  buses.'*  In  the  first  model,  called  ihe  folded  bus  model  (see  Figure  1(a)).  each  processor 
51  transmits  on  the  lower  half-segment  of  a  bus.  while  receiving  from  the  upper  half-segment. 
5t  In  the  second  model,  called  the  dual  bus  model  (see  Figure  1(b)),  each  processor  is  connected 
r  to  two  buses,  one  for  downstream  transmitting  and  upstream  receiving  and  the  other  for 
.«  upstream  transmitting  and  downstream  receiving. 

••  An  optical  bus  consists  of  three  waveguides,  one  for  carrying  messages,  one  for  carrying 
m  reference  pulses  and  one  for  carrying  select  pulses,  which  we  call  the  message  waveguide,  the 
reference  waveguide  and  the  select  waveguide  respectively.  Messages  are  organized  as  message 
«  frames,  which  have  a  certain  fi.xed  length.  The  propagation  delay  on  the  reference  waveguide 

K  is  the  same  as  that  on  the  message  waveguide  but  not  the  same  as  that  on  the  select 

M  waveguide.  A  fi.xed  amount  of  additional  delay,  which  we  show  as  loops  (see  Figure  2),  are 
.1  inserted  onto  the  reference  waveguide  and  the  message  waveguide. 

The  basic  idea  of  using  coincident  pulse  techniques  as  an  addressing  mechanism  is  as 
«  follows.  Addressing  v7f  a  destination  processor  is  done  by  the  source  processor  which  senus 
M  a  reference  pulse  and  a  select  pulse  with  appropriate  delays,  so  that  after  these  two  pulses 
«  propagate  through  their  corresponding  waveguides,  a  coincidence  of  the  two  occurs  at  the 
HI  desired  destination.  The  source  processor  also  sends  a  message  frame  which  propagates 
Ti  synchronously  with  the  reference  pulse.  Whenever  a  processor  detects  a  coincidence  of  a 
reference  pulse  and  a  select  pulse,  it  reads  the  message  frame.  In  essence,  the  address  of  a 
•1  destination  processor  is  unary  encoded  by  the  source  processor  using  the  relative  transmission 
•I  time  of  a  reference  pulse  and  a  select  pulse. 

More  specifically.  let  w  be  the  pulse  duration  in  seconds,  and  let  Cf,  be  the  velocity  of 
>  light  in  these  waveguides.  Define  a  unit  delay  to  be  the  spatial  length  of  a  single  optical 
pulse,  that  is  w  x  Cf,.  Starting  with  the  fact  tli.it  all  three  waveguides  have  equal  intrinsic 
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propagation  deiavs,  we  add  one  unit  delay  (shown  as  one  loop)  on  the  reference  waveguide 
and  the  message  waveguide  between  any  two  adjacent  receivers.  In  the  folded  bus  model, 
this  means  that  one  unit  delay  is  added  between  any  two  processors  on  the  upper  haff- 
(recetvmg)-segment  of  the  reference  waveguide  and  the  message  waveguide  as  shown  in 
Figure  2(a)  Since  there  are  no  changes  on  the  lower  half-(transmitting)-segments  of  any 
waveguide  and  the  message  waveguide  has  exactly  the  same  length  as  the  reference 
waveguide.  Figure  2(a)  shows  only  the  upper  receiving  segments  of  the  select  waveguide 
and  the  reference  waveguide  for  a  16-processor  system  with  a  unary  addressing  implemen¬ 
tation.  Let  be  the  time  when  processor  i  transmits  its  reference  pulse  and  T,^.i(j)  he  the 
time  when  it  transmits  a  select  pulse.  With  delays  added  on  the  reference  waveguide  as  in 
Figure  2(a),  these  two  pulses  will  coincide  at  processor  /  it  and  only  if 

T  j  (1) 

where  0  s  i,  /  <  /V  anc  iV  is  the  total  number  of  processors  in  the  system. 

This  means  that  for  a  given  reference  pulse  transmitted  at  time  r,  the  presence  of  a  select 
pulse  at  time  t  -i-  j  will  address  processor  /  while  the  absence  of  a  select  pulse  at  that  lime 
will  not.  Since  we  have  0  ^  ( T^.t(j)  ~  T,.,)  <  .V,  it  is  clear  that  iV  time  units  are  needed  to 
encode  the  complete  address  information  of  rV  processors  with  a  unary  addressing 
implementation.  VVe  define  an  adJress  frame  to  be  the  address  information  in  the  form  of 
a  sequence  of  either  the  presence  or  the  absence  of  select  pulses  relative  to  a  given  reference 
pulse.  With  a  unary  addressing  implementation,  an  address  frame  has  a  length  of  N  units 
long, 

Figure  2(b)  shows  the  poisiiion  of  the  reference  pulse  and  the  select  pulse  addressing 
processor  /  at  the  transmission  time  of  an  address  frame  in  the  folded  bus  model.  The 
relative  position  of  the  select  pulse  to  the  reference  pulse  will  remain  the  same  from  the 
time  the  address  frame  is  transmuted  to  the  time  it  finishes  propagation  through  the 
transmuting  segment  of  the  bus.  However,  the  relative  position  will  be  changed  as  the 
address  frame  propagates  through  the  receiving  segment  of  the  bus. 

From  the  value  of  the  term  -  T,^,.  it  is  also  clear  that  the  relative  positions  of 

the  reference  pulse  and  select  pulses  are  independent  of  the  sending  processor  at  the 
transmission  time  m  this  model.  However,  in  the  dual  bus  model,  where  a  sendinc  processor 
could  transmit  downstream  and  upstream  using  two  buses,  the  necessarv  and  sufficient 
condition  for  a  select  pulse  to  coincide  with  the  reference  pulse  is 

j  -  I .  if  1^1  (Za) 

or 

-  T'rcf  '  -  ;•  if  /  <  <  (2b) 

Notinc  that  these  two  models  have  equivalent  functionalities  and  similar  operations,  we  will 
conceniraie  our  discussions  on  ihe  first  model,  namely  the  folded  bus  model  throughout  the 
rest  of  this  paper. 

One  advantage  of  using  coincident  puKe  techniques  as  an  addressing  mechanism  is  its 
applic.ibilitv  to  multicasting  Tr.iditional  addressing  mech.inisms  for  multicastinc.  such  as 
scp.irate-.iddrcssing.  niuiii-desiinaiion  addressing  and  source  rouiinc  base  been  mainlv 
developed  firr  pnint-to-pumt  networks  and  are  inetficienr,  especullv  m  bus  connected 
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systems.  Most  recent  research  work'"  exploits  tree  forwardin'^  on  broadcast  networks  which 
constructs  multicast  trees  and  uses  group  identifiers  when  multicasting.  It  requires  explicit 
group  formations  of  communicating  processors. 

Using  coincident  pulse  addressing  however,  the  sender  can  multicast  to  an  arbitrary  group 
of  processors  by  sending  an  address  frame  containing  one  or  more  select  pulses  which  are 
properly  positioned  so  that  each  of  them  coincides  with  the  reference  pulse  at  one  of  the 
multicasting  destinations.  Once  a  processor  detects  a  coincidence,  it  picks  up  a  copy  of  the 
multicasted  message  frame  which  is  synchronous  with  the  reference  pulse. 


3  TWO-LEVEL  ADDRESSING 

Using  unary  addressing,  an  address  frame  is  N  units  long.  There  are  two  reasons  why  we 
want  to  reduce  the  length  of  address  frames  by  using  two-level  addressing.  One  has  to  do 
with  efficiency.  Unary  addressing  could  be  very  inefficient  in  a  large  multiprocessing  system 
where  the  address  frame  is  longer  than  the  message  frame.  The  other  reason  has  to  do  with 
the  physical  limitation  of  optical  path  length  between  two  adjacent  processors.  One  way  to 
ensure  that  the  frames  sent  by  one  processor  do  not  collide  with  other  frames  sent  by  other 
processors  is  to  arbitrate  the  bus  to  allow  e.xclusive  access  by  one  processor  at  a  time,  as 
in  References  4  and  11.  Another  way  is  to  pipeline  the  bus.  That  is.  to  synchronize  all 
processors  such  that  they  will  send  messages  at  the  beginning  of  each  cycle.  The  propagation 
delays  between  two  adjacent  processors  should  be  large  enough  to  prevent  frames  from 
overlapping  as  in  References  8  and  ll.  If  unary  addressing  is  used,  it  is  necessary  for  the 
optical  path  between  any  two  adjacent  processors  to  have  a  length  of  at  least  N  x  w  x 
to  prevent  overlapping  of  the  address  frames.  Although  the  required  minimum  optical  path 
length  can  be  reduced  by  shortening  the  pulse  width  w,  the  address  frame  length,  which  is 
linear  in  the  system  size,  becomes  a  limiting  factor. 

A  two-level  addressing  implementation  divides  the  whole  system  into  logical  clusters. 
Addressing  of  a  single  destination  is  accomplished  by  using  one  level  of  unary  addressing 
to  select  a  particular  cluster  and  another  level  of  unary  addressing  to  select  an  individual 
processor  within  the  selected  cluster.  Two  trains  of  select  pulses  are  used,  one  for  each  level 
of  addressing  and  their  pulse  trains  are  sent  in  parallel.  Therefore,  the  length  of  address 
frames  can  be  reduced  as  neither  the  number  of  clusters  nor  the  size  of  any  cluster  is  larger 
than  the  system  size. 

Assume  that  /V  =  n-  processors  are  linearly  connected  If  every  n  consecutive  processors 
constitute  one  logical  cluster,  two-level  addressing  in  this  linear  system  is  logically  equivalent 
to  addressing  a  two-dimensional  array.  More  specifically,  we  can  view  the  linear  system  as 
the  result  of  embedding  an  n  x  n  array  in  row  major  fashion.  Each  row  of  processors  of 
the  array  is  embedded  into  n  consecutive  processors  in  the  linear  system.  Hence,  selecting 
a  logical  cluster  is  equivalent  'o  selecting  a  row  while  selecting  an  individual  processor  within 
a  cluster  is  equivalent  to  selecting  a  column  processor  within  a  row. 


3.1.  Two-level  addressing  m  a  linear  svsrem 

,-\s  mentioned  above,  we  will  view  a  h.  ear  s-  stem  wth  .V  =  „  •<  „  processors  as  a  result 
of  embedding  an  n  n  arr.iv  m  row  m.ijor  f.ishion  .ind  use  'erms  such  .is  riaw  .  column' 
and  diagonal'  logically  As  .i  logical  equivalent  to  two-icvel  addressine.  i  Iwo-dimensional 
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iv  addressifi;;  implementation  uses  two  select  wa\eguitJes'  one  to  select  a  row.  and  ai’other  to 
;.o  select  a  column.  A  pulse  sent  on  the  row  select  waveguide  will  coincide  with  the  rererence 
pulse  at  all  the  processors  in  a  particular  row,  while  a  pulse  sent  on  the  column  select 
iM  waveguide  will  coincide  with  the  reference  pulse  at  all  processors  in  a  column.  In  other 
.*  worus,  a  row -select  pulse  causes  a  row  select  trace  and  a  column  select  pulse  causes  a 
column  select  trace.  Pulses  sent  on  these  two  select  waveguides  are  denoted  by  IV'  1  and 
tM  VV'  2  respectively. 

.A  coincidence  is  said  to  occur  at  a  given  processor  only  if  all  r/iree  pulses,  namely  a 
i»i  reference  pulse,  a  VV'  1  pulse  and  a  VV  2  pulse,  coincide  wnh  each  other  at  that  processor. 
i:i  Since  unary  addressi  g  is  used  when  selecting  a  row,  each  pulse  in  the  pulse  tram  of  VV  1 
it:  corresponds  to  one  row.  Similarly,  each  pulse  in  the  pulse  train  of  W  2  corresponds  to  one 
.>1  column.  Therefore,  sending  a  reference  pulse  and  a  pair  of  one  W  I  pulse  and  one  VV  2 
i-j  pulse  causes  a  coincidence  at  a  processor  that  is  located  at  the  intersection  of  the 
i7<  corresponding  row  and  column.  More  specifically,  we  denote  L,'  to  be  the  pulse  of  VV  1 
!•*  selecting  row  i  and  denote  to  be  the  pulse  of  W  2  selecting  column  j.  By  sending  these 
ITT  two  pulses  and  the  reference  pulse,  the  processor  at  row  i  and  column  j.  which  is  processor 
i  n  +  j  in  an  iV  =  /r  linear  structure,  is  addressed  Figure  3  shows  a  logical  two- 
,->1  dimensional  view  of  addressing  processor  10  with  these  two  select  pulses  when  r  =  /  =  2 
and  N  =  16. 
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M  (n  order  to  achieve  the  above  coincidence  pattern,  we  add.  as  in  unary  addressing,  one 
IV-  unit  delay  between  two  processors  on  the  receiving  segments  of  the  reference  waveguide 
i«  and  the  message  waveguide.  In  addition,  on  the  receiving  segment  ot  the  row-  select 
lu  waveguide  (VV  1  waveguide),  one  unit  delay  is  added  between  successive  processors  and  an 
.a  extra  unit  delay  is  added  between  two  processors  of  successive  rows.  On  the  receiving 
;«  segment  of  the  column  select  waveguide  (W  2  waveguide),  n  unit  delays  are  added  between 
two  receivers  of  successive  rows.  The  amount  of  delay  added  on  these  two  select  waveguides 
M  can  be  obtained  by  solving  a  set  of  underconstrained  equations  (see  Appendix).  Again, 
because  the  message  waveguide  has  exactly  the  same  length  as  the  reference  waveguide  and 
the  lower  half-ftransmittingl-segments  of  all  waveguides  to  not  have  any  delays,  only  the 
-1  receiving  segments  of  the  reference  waveguide  and  ihe  two  select  waveguides  are  shown  m 
••  Figure  4(a)  Taps  from  the  waveguides  to  the  processors  are  also  omitted  from  the  figure 
Let  r,  and  c,  be  the  row  number  and  column  number  of  processor  /  respectively  That  is. 
u  I  -  ^  c,  where  l)  <  /  <  .V  .i.nd  ()  s  r  .  c,  <  n  Let  T  ,  oe  the  ime  wnen  a  processor 

tr.insmits  us  reference  pulse,  .And  runher  assume  a  processor  Transmits  a  H'  1  pulse  selecting 
.  row  r,  at  time  )  and  a  VV  2  pulse  selecting  column  c.  at  time  r„,tL',').  Given  the 
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(3)  the  receiving  segments  of  the  mfertnee.  the  row 
select  and  the  column  select  waveguides 


Ref  I 


Row  (Wl) 


1  1 

1  1 

U 

,l  1  1 

f.,» 

1  Cl’  1  f-r  1 

1  !  ‘■1* 

M 

1 

(b)  Rilauve  pulse  posiuons  of  R'  1  and  W2 
m  an  address  frame 


Figure  -I  A  (tvo-level  addressing  implemeniation 


ivt  added  delays  on  three  waveguides  as  shown  in  Figure  4{a),  a  coincidence  of  these  three 


■«  pulses  will  occur  at  processor  j  if  and  oniy  if 

""  -t-  0  +  '■,)  =  7’ref  j  (3a) 

.vjc  and 

7'^i(i-;')  +  't  X  r,  =  T,^i  +  j  U'b) 

.v>:  That  IS, 

=  T,^(  -  f,  (-a) 

ru  and 

5*'  T^i(^V)  =  c,  (ab) 

3»  Since  0  s  <  n,  a  VT  1  pulse  is  ahead  of  a  reference  pulse  by  0  up  to  /?  -  1  units.  The 


ar  presence  of  a  W  1  pulse  r,  units  ahead  of  a  reference  pulse  selects  processors  at  ro"  r,  while 
the  absence  does  not.  Similarly,  a  W  2  pulse  is  0  up  to  rr  -  1  units  beyond  a  reference  pulse 
.w  and  :he  presence  of  a  W'  2  pulse  i,  units  beyond  reference  pulse  selects  priscesstsr-'  at  cuiu  nn 
c,  at  each  row  while  the  absence  docs  not  An  address  frame  m  the  two-level  addr.svirc 
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M  implementation  contains  a  tram  oi  U'  1  pulses  and  a  tram  ot  U'  2  pulses  and  has  a  leneth 
of  2  -s  n  -  1  units  Iona.  Fi'^ure  -Ifh)  sht)ws  relative  positions  of  the  reference  pul:  e  and  two 
trams  of  select  pu'ses  ::i  an  .iJdreis  frame  at  the  time  of  transmission.  Again,  owing  to  the 
;u  fact  that  an  address  trame  will  remain  the  same  as  it  propagates  through  the  transmuting 
segment  of  the  bus  in  the  folded  bus  model,  the  relative  positionings  of  the  reference  pulse 
and  both  W  [  and  '.V'  2  pulses  in  an  address  frame  at  the  time  of  transmission  are  independent 
of  the  sending  processor. 

,.  .Multuasting  to  a  group  of  processors  can  be  accomplished  by  sending  a  reference  pulse. 
,  one  tram  of  W  1  pulses  with  one  or  more  pulses  present  and  one  train  of  IV  2  pulses  with 
one  or  more  pulses  present  along  with  a  message  frame.  However,  by  doing  so.  coincidences 
"I  mav  also  occur  at  unintended  processors,  which  we  call  shadows.'*  For  e.'cample,  when  both 

-  processor  i  and  j  are  addressed,  the  VV  1  train  consists  of  tv/o  pulses,  one  for  row  r,  and 

-  another  for  row  r..  Similarly,  the  U  2  train  also  consists  of  two  pulses,  one  for  column  c, 
and  another  for  column  c,.  In  addition  to  processor  i  and  j.  the  processor  at  row  r  and 
column  c,  also  detects  a  coincidence  and  picks  up  a  copy  of  the  multicasted  message.  So 
does  the  processor  at  row  and  column  c,.  Figure  5  shows  a  logical  two-dimensional  view 
of  shadows  at  processor  I  and  10  as  a  result  of  multicasting  to  processors  2  and  9  in  a  16 

;:5  processor  -system 


2'  2.  Shadow  reduction 

As  can  be  seen  from  the  above  example,  shadows  are  created  because  of  the  unintended 
couplings  of  a  VV  I  pulse  with  a  W  2  pulse.  One  way  to  reduce  shadows  is  to  further  identify 
iv  the  intended  pairs  by  using  additional  select  waveguides,  called  check  waveguides  tor  carrying 
1  select  pulses  called  check  pulses.  Check  pulses  are  arranged  such  that  they  do  not  coincide 
with  the  reference  pulse  at  places  where  shadows  were  created.  Only  processors  at  which 
coincidences  of  a  reference  pulse  and  ail  select  pulses  occur  are  addressed.  This  technique 
--  for  shadow  reduction  was  introduced  in  Reference  9  for  addressing  a  two-dimensional 
:■  memorv  structure.  In  the  remainder  of  this  section,  we  will  show  how  to  apply  this  technique 
to  two-level  addressing  m  a  linear  system.  Note  that  having  an  additional  check  waveguide 
..  m  two-level  addressing  is  different  from  having  three-level  addressing.  The  latter  would  be 
..  Iviaically  equivalent  to  addressing  a  three-dimensional  array.  That  is.  addressing  a  single 
destination  would  require  three  select  pulses.  The  address  frame  length  would  be  further 
reduced  while  more  snadows  would  be  likely  when  multicasting  with  three-level  addressing. 


L  ■ 


Q  (i)  ®  © 
0  0  '9  O 


Shadow  pnKC&sor 
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Ci)  i  logical  view  of  (races  of  IV  3  pulses 
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(b)  a  logical  view  of  traces  of  W4  pulses 
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(c)  (he  receiving  segments  of  the  IV3  and  the  IV  4  waveguides 
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(d)  relauve  pulse  posiDonsof  IV3  and  IV4 
in  an  address  frame 

Figure  6  Adding  two  check  pulses  VV  3  and  'V  4 


:■>  One  such  set  of  check  pulses  are  45°  diagonal  select  pulses,  which  will  be  called  ff' 3 
;u  hereafter.  Another  are  -45°  diagonal  select  pulses,  which  will  be  called  W  4  Each  pulse 
in  a  W  3  or  VV  4  train  coincides  with  the  reference  pulse  at  all  the  processors  that  are  on  a 
;•  45°  or  -45°  diagonal  line  respectively,  and  therefore  selects  all  processors  on  that  diagonal 
:iT  line.  Let  Lf*  be  the  Vf'  3  pulse  selecting  a  45°  diagonal  line  which  is  k  lines  below  or  above 
the  mam  45°  diagonal  lir.e  respectively.  Figure  6(a)  shows  a  logical  tsvo-dimensional  view 
■V  of  traces  of  W  3  pulses.  Similarly,  let  be  the  W  4  pulse  selecting  a  -45°  diagonal  line 
v.  which  is  k  lines  above  or  below  the  main  -45’  diagonal  line  respectively  Figure  6(b)  shows 
:■!  a  logical  two-dimensional  view  of  traces  of  VV  4  puhes. 

Again,  the  amount  of  delay  that  should  be  added  on  the  IV  3  and  U' 4  waveguides  can 
be  obtained  bv  sidving  a  set  of  underconstrained  equaooro  Ficure  nicl  shows  oni\  the 
receiving  segments  ot  both  W  3  and  W  4  wavegunJes  with  addeiJ  Jelavs  .As  a  result,  a  U’  3 
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1"  puliC  jn  l  a  4  pulse  will  coincide  with  the  reference  pul>e  at  the  desireil  correspondmc 
locations  if  and  only  ii. 

T^Ln  =  r,,.  k  (5a) 

L".  and 

-  T.^,  - m 

These  two  equations  can  be  derived  as  follows.  First,  any  processor  at  a  45°  diagonal  line 
;-i  which  IS  k  lines  below  the  mam  diagonal  line  has  the  mde.x  number  of  k  x  rt  +  t  x  (n  -  1). 

for  I  £  I  <  (rt  -  k).  This  means  that  the  reference  pulse  will  go  through  k  x  n  +  i  x  (n  -  1) 
:ai  unit  delays  to  arrive  at  the  processor.  Given  that  there  are  n  -  I  added  unit  delays  at  the 
beginnina  of  each  row  on  the  receiving  segment  of  the  W  3  waveguide  as  in  Figure  6(c) 
(with  n  =  4),  the  W  3  pulse  Ly  will  coincide  with  the  reference  pulse  at  the  processor  if  and 
only  if  r^i(L*)  +  ^  t  i)  k  (n  -  1)  =  T,,,  +  /t  x  +  i  x  (n  -  1),  that  is,  T,,., (/.*,)  = 
T,^(  +  k.  Similarly,  we  can  completely  denve  the  above  equations. 

In  addition  to  a  IV  1  train  and  a  W  2  train,  an  address  frame  now  also  contains  a  fV  3 
x  train  as  well  as  a  W  4  tram.  Adding  a  W  3  train  and  a  H'  4  train  does  not  change  the  length 
ot  an  address  frame,  nor  it  does  change  the  content  of  the  W  1  train  or  the  VV  2  train, 
:•!  Figure  b{d)  shows  the  relative  positions  of  two  trams  of  IV  3  and  IV  4  pulses  at  the  time  of 
-r  transmission  of  an  address  frame. 

.As  an  e.xample.  the  two  shadows  in  Figure  5  can  be  eliminated  by  using  two  W  3  pulses, 
namely  pulse  L‘i-'  and  pulse  Lf'.  A  more  complicated  e.xample  is  shown  in  Figure  7  Figure 
:•<  7(a)  shows  a  logical  view  of  si,x  shadows  created  at  processors  0.  2.  9.  10.  12  and  13  when 
c  multicasting  to  three  processors  I,  3  and  14  without  using  any  check  pulses.  Figure  7("i) 


'3)  3  iogfcai  vaew  nx  s/udows wll^ou(  a  iogicai  view  orcUmioaung  sfa4lows  in  fa) 

asm?  anv  check  pulses  •‘i-h  3  poises  and  Jvte  Wi  puiiei 

F'\;ure  *  Shudo'  rcdtuiion  puls^.' 


number  of  shadows 


shows  a  logical  view  of  eliminating  these  shadows  by  adding  three  VV  3  pulses  and  three 
W  4  pulses. 

Figure  8  shows  simulation  results  on  the  numbers  of  shadows  created  with  different 
number  of  check  pulses  used.  It  is  clear  that  the  addition  of  check  pulses  cannot  introduce 
new  shadows.  It  can.  only  reduce  the  number  of  existing  shadows.  However,  adding  a  fixed 
number  of  check  pulses  cannot  always  completely  eliminate  shadows,  since  the  theorem 
given  in  Reference  9  for  a  two-dimensional  parallel  memory  structure  also  holds  here. 


4,  SHADOW  AVOIDANCE 

Having  established  the  relationship  between  a  particular  two-level  addressing  structure  and 
its  logically  equivalent  two-dimensional  addressing  representation,  we  will  adapt  to  the  usual 
notion  of  two-dimensional  addressing  in  an  n  x  n  array  in  the  following  discussions  for  the 
purpose  of  simplicity.  However,  it  is  worth  noting  that  techniques  developed  will  be  applied 
in  physically  linear  svstems  with  ^vo-level  addressing. 

One  way  to  avoid  shadows  when  multicasting  to  a  group  of  processors  is  to  partition  the 
whole  group  into  several  subgroups  such  that  each  subgroup  is  multicasted  within  one  cycle 
without  creating  any  shadows.  More  formally,  assume  a  group  of  m  processors  is  a  >et  ot 

linearly  ordered  processors  denoted  by  G  =  {Px-Pi . P„}.  That  is.  for  all  1  <  /  s  m. 

0  s  P,  <  jV  and  P,-,  <  P,.  Define  a  shadow  free  (SF)  partition  of  the  set  G  to  be  a  number 
of  subgroups  5,  through  such  that  for  all  I  s  i, ;  <  g  the  following  conditions  are  satisfied, 
(a)  5,  n  5,  =  <b  if  i  ^  j.  (b)  U  5,  =  G  and  (c)  each  S,  can  be  multicasted  in  one  cycle 
without  any  shadows. 

A  number  of  subgroups  is  called  a  maximal  SF  partition  if  it  is  a  SF  partition  and  if 
multicasting  to  more  than  one  of  the  subgroups  within  one  cycle  will  create  a  shadow. 
Therefore  the  number  of  subgroups  of  a  maximal  partition  is  the  number  of  cycles  needed 
to  complete  the  multicasting  to  the  whole  group  G. 

Let  processor  X,  where  0  s  .V  <  /V.  be  a  shadow  created  when  multicasting  to  a  group 
G  in  one  cycle,  clearly  X  ^  G.  Define  four  shadow  conditions  (5C,.  i  =  1.  2.  3  and  4)  as 
follows. 
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.VC,:  [here  is  at  Ica^t  oiio  processor  in  G  that  is  in  the  same  row  ,is  X 

Vc'-:  there  is  at  le.i^t  one  processor  in  G  that  is  in  the  same  column  a-.  A' 

SC\:  there  is  at  least  one  processor  in  G  that  is  in  the  same  45’  diagonal  line  as  X 

,SC  there  is  at  least  one  processor  in  G  that  is  in  the  same  -45’  diagonal  line  as  A'. 

It  can  be  verified  that  each  of  the  conditions  SC,  is  necessary  for  .V  to  be  a  shadow  if  the 
corresponding  select  pulse  U' /  is  used  to  multicast  the  group  G.  The  logical  AND  of  all 
the  necessary  conditions  becomes  the  sufficient  condition.  For  example,  if  two  pulses,  W  1 
and  W  2.  are  used,  then  both  condition  SC,  and  5^2  are  the  necessary  conditions  for  a 
shadow  to  occur.  The  logical  AND  of  the  two  is  the  sufficient  condition. 


4. 1.  Rifgiilar  multicasting  patterns 

In  some  applications,  such  as  finite  element  analyses  and  image  processing,  multicasting 
patterns  can  be  quite  regular.  For  example,  a  convolution  of  an  n  x  n  array  involves 
multicartings  of  an  element  to  its  iv  x  w  neighbours,  where  w  is  the  current  window  size. 
A  group  to  be  multicasted  could  also  be  all  processors  of  a  row,  or  of  a  column  or  of  a 
diaeunal  line.  By  embedding  a  physical  2-D  structure  into  our  linear  structure  in  the  row- 
major  fashion,  these  regular  2-D  patterns  can  be  characterized  by  a  group  of  four  parameters. 
More  formally,  in  an  embedded  n  x  n  system,  we  consider  a  group  C  of  m  processors 
starting  with  the  processor  numbered  as  k  (called  offset)  with  increment  of  d  (called  stride). 
Using  the  general  notation  in  the  beginning  of  the  section,  we  have  G  =  {k.  k  d, 
k  -r  {m  -  1)  X  d) .  where  0  ^  k  ^  k  +  {m  -  1)  x  d  <  N  =  /r.  We  can  use  G  {k.  d.  nu 
n)  to  uniquely  represent  such  a  regular  group.  We  call  a  group  a  dense  group  if  d  is  less 
than  n,  a  sparse  group  othenAise. 

W'liile  we  can  make  tradeoffs  between  the  number  of  select  waveguides  used  and  the 
number  of  cycles  needed  to  multicast  to  a  group  of  processors,  we  will  first  analyse  simple 
cases  in  which  only  two  select  waveguides  arc  used.  The  results  will  be  extended  to  cases 
in  which  four  select  waveguides  arc  used. 

Definition  1.  A  row  of  processors  in  a  logical  two-dimensional  array  is  incomplete  with 
regard  to  a  group  G  (k.  d.  m.  n)  if  and  only  if  the  row  contains  two  processors  i  and  /  such 
that  i  ^  G.  j  ^  G  and  j  -  ij  =  u  '  for  some  integer  f>  >  0  A  row  is  complete  if  and 
onlv  if  the  row  con'ams  at  least  one  processor  of  the  group  C  and  is  not  an  incomplete  one 

Definition  2.  Define  /  [k.  d.  m.  n)  to  be  the  number  of  incomplete  rows  wuh  regard  to  the 
group  G 

Let  the  first  processor  oi  the  group  be  ^  x  n  -  Ci  and  the  'ast  processor  of  the 
group  I  =  k  {rn  -  1)  x  d  =  r,  x  i,  ~  c,  (ot  s-  me  integers  0  s  r^,  c^.  r,,  c,.  <  n.  And  let 
^.ondition  1  be  that  c^  >  J.  condition  2  be  that  n  -  c,  >  d  and  condition  3  be  that  r,  *  r, 
There  will  be  two  incomplete  rows,  namely  row  r^  and  row  r,  if  and  onlv  if  all  three 
conditions  are  true  There  will  be  no  incomplete  rows  it  and  only  if  neither  condition  1  nor 
condition  2  is  true  Otherwise,  there  will  be  only  one  incomplete  row  Therefore.  /  has  an 
upper  bound  of  2.  Noting  that  for  a  sparse  group  G  (k.  d.  m.  n)  where  d  ^  n.  neither 
condition  1  nor  condition  2  is  true,  therefore  /  =  0. 
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u>  Lemma  1.  T'.'«o  processors  of  a  group  C  (k.  d,  m.  n)  numbered  as  i  and  /.  i  <  j.  '.Mil  be  m 
w.  the  same  column  li  and  only  i;  j  —  i  =  b  'X-  LCM  (n,  d)  for  some  integer  b  >  O.r 

n7  Proof.  Let  /  =  r,  x  «  +  c,,  and  j  -  x  n  +  before.  On  the  one  hand,  if  c,  =  c,,  then 

“  /  “  '  ~  {'’/  ~  ^  't  and  >  /•,.  Since  both  processors  i  and  /  arc  in  the  same  group,  y  -  i 

must  be  a  multiplier  of  J,  therefore  /  -  i  should  be  a  common  multiplier  of  both  n  and  J. 
S'.!  On  the  other  hand,  if  /  -  i  =  w  x  LCM  {d.  n)  for  some  integer  ■>  >  0,  then  clearly. ;  -  i 
IS  a  multiple  of  n,  which  means  processor  i  and  j  are  at  the  same  column.  ■ 

iM  Lemma  2.  If  there  is  a  processor  /  in  a  group  G  {k.  d.  m.  n).  i  =  r,  x  n  ~  c,,  then  processor 

>54  /  at  the  same  column,  j  =  r,  x  n  +  c„  is  also  in  the  group  G  if 

(  -  bxd 

CCD  (n.d) 


”4  and  row  r,  is  a  complete  one. 

i«7  Proof.  (By  contradiction.)  According  to  the  definition  of  complete  row.  there  must  be  a 
>.  processor  j  at  rosv  r,  and  j  E  C.  Clearly.  [/  -  ij  should  be  a  multiple  of  d.  Ir.  addition,  since 
<4  processor  j  and  t  are  at  the  same  column  and  are  {b  x  d).'(GCD  (n.  J))  rows  apart, 
w  \j  ~  <1  =  Vi  ~  r,\  X  n.  That  is. 


*;  which  is  also  a  multiple  of  d.  Therefore,  |y  -  /|  must  be  a  multiple  of  d  also.  If  /  S  C, 
then  row  r,  is  not  a  complete  one,  which  contradicts  the  condition  stated  above.  Therefore, 
VM  /  e  G.  ■ 

According  to  the  above  lemmas,  for  a  given  group  G.  if  we  draw  one  vertical  line  at  each 
4.  processor  of  the  group  G,  then  all  processors  at  the  intersections  of  these  vertical  lines  with 

-  complete  rows  which  are  (b  x  d)/[GCD  (n.  <f))  apart  will  belong  to  the  group  C.  and 
..  therefore  can  be  multicasted  without  any  shadows  using  row  select  pulses  VV  1  and  column 

-  select  pulses  IV'  2. 

Theorem  1.  For  a  dense  group  G  (k.  d,  m,  tt)  ivhere  d  <  n.  the  number  of  subgroups  of  a 
•.  maximal  partition  wun  select  pulses  W  1  and  W  2  has  an  upper  bound  of  d/(GCD  in.  d)j  + 

M  /. 

"4  Proof.  We  will  prove  the  theorem  by  constructing  a  SF  partition  of  the  group 
Ti  First,  we  use  one  subgroup  for  processors  of  tiie  group  C  at  each  incomplete  row 

-  .According  to  the  definition,  there  are  /  such  subgroups  and  no  shadows  will  occur  in  any 
uf  ihese  subgroups  because  of  the  shadow  condition  SQ.  We  partition  the  re^t  of  group  at 

-  remaining  complete  rows  as  follows. 

Let  R  =  d'{GCD  [n.  d)j.  Starting  at  the  first  complete  row.  we  put  processors  of  the 
•4  group  at  evers  R  r>rws  .ipart  into  a  subgroup,  creating  evactiv  R  subgri’cps  It  can  be 
...  similarly  proved  as  m  the  Lemma  2  that  no  sh.idows  will  occur  m  ans  ot  these  R  suburoups 
Therefore,  a  maximal  SF  partition  will  h.ive  at  most  tl/(GCD  (n.  it)|  -  /  subgroups  ■ 
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(t  can  be  shown  that  (he  SF  partition  constructed  m  the  above  theorem  is  indeed  a 
maximal  SF  partition.  We  can  similarly  prove  the  follewmc  theorem  for  a  },panc  group. 
First  of  all,  I  is  equal  to  zero  when  d  ^  n.  Secondly,  between  any  two  rows  which  are  R 
rows  apart,  where  R  =  J.IGCD  (n,  d)J.  there  could  be  at  most  (R  x  n)  d  processors 
belonging  to  the  group  and  hence  at  most  [LCM  (n.  d)\ld  compUie  rows.  Therelore,  whan 
putting  processors  of  the  group  every  R  rows  apart  into  one  subgroup,  there  are  at  most 
[LCM  (n,  d)\id  subgroups  in  such  a  SF  partition  of  the  group  G  This  is  stated  in  Theorc.m 
2, 

Theorem  2.  For  a  sparse  group  G  (k,  d.  m.  n)  where  d  s  n.  the  number  of  subgroups  of  a 
ma.ximal  partition  with  select  pulses  W  I  and  VV^  2,  has  an  upper  bound  of  [LCM  [n.  d)\fd. 

a 

We  can  also  prove  the  following  theorems  when  using  either  one  of  the  two  diagonal 
select  pulses,  namely  W  3  or  W  4,  with  W  1  instead  of  using  W  2  w  ith  1  as  in  the  above 
theorems. 

Theorem  3.  For  a  dense  group  G  (k,  d,  m,  n)  where  d  <  n.  the  number  of  subgroups  of  a 

ma.ximal  partition  with  select  pulses  VV  1  and  either  W  3  or  W  4.  has  an  upper  bound  of 

c/.'[GCD  (n  -  1,  d)]  +  /  or  d.[GCD  (n  +  1.  d)\  +  /  respectively.  B 

Theorem  4.  For  a  sparse  group  C  (k,  d,  m.  n)  where  d  ^  n.  the  number  of  subgroups  of  a 

maximal  partition  with  select  pulses  W  I  and  either  W  3  or  U'  4  has  an  upper  bound  of 

(LCM  (n  -  1,  cf)\ld  +  1  or  [LCM  (rt  +  I,  d)\fd  respectively.  B 

The  idea  used  to  construct  SF  panitions  in  Theorem  3  and  4  is  similar  to  the  one  used 
in  Theorem  1  and  2.  Processors  of  a  group  ai  certain  number  of  rows  apart  will  be  put  into 
one  subgroup.  These  processors  are  at  the  intersections  of  these  rows  with  either  45°  diagonal 
lines  or  -45°  diagonal  lines  depending  on  which  one  of  the  select  pulses.  3  or  W  4,  is 
used. 

Since  the  additions  of  select  pulses  will  not  create  anv  new  shadows,  we  can  choose  a  SF 
partition  which  has  the  least  number  of  subgroups  when  all  four  select  pulses  discussed 
above  are  used. 

Theorem  5.  For  a  dense  zroup  G  (k.  d.  rn.  n)  where  </  <  n.  the  number  of  subgroups  of  a 
ma.ximal  partition  with  four  select  pulses  W  I,  W  2.  H'  3  .md  U  4  has  ,in  upper  bound  ot 


maxlGCDlfn  -  1).  d).  GCD  (n.  d).  GCD  ((n  1 ).  i/)) 

Theorem  6,  For  a  sparse  group  G  {k.  d.  m.  n)  where  d  s  n.  the  number  of  subgroups  oi  j 
maximal  partition  with  four  select  pulses  W  1.  W  2,  VV'  3  and  VV  4  has  an  upper  bound  of 

/LCM((n  -  l).d)  LCM(n.</)  LCM  i(n  -  1 ).  di\ 

- - - »i.  — j — , - j - 1  ■ 

By  appivmc  Theorem  5  or  Theorem  6  to  some  sreci.il  m^mnees  ot  j  group  G  'A.  ,/ 

'0.  such  as  a  enujp  ot  proccN'ors  at  one  row  with  d  =  1  '  grou|  ot  processors  ,ii  iine 

column  with  d  -  n.  a  croup  or  processors  at  eitt’s'r  auconal  lines  ixith  if  -  n  -  1  or 


d  =  n  -i-  1,  we  know  that  each  multicasting  operation  to  such  groups  can  be  don  ;  n  onlv 
one  cycle  without  any  shadows.  These  multicasting  patterns  are  i  .'ten  seen  in  matrix 
manipulations,  among  many  other  applications. 

The  two  theorems  above  can  also  be  extended  to  a  group  of  processors  located  in  a  small 
area  of  an  «  x  n  array.  We  delimit  the  area  by  an  n  x  n  array  with  it  s  n.  The  processors 
of  the  /i  x  n  array  can  be  renumbered  in  a  row  major  fashion  from  0  to  n-  -  1.  If  a  group 
of  processors  can  be  represented  as  G  (k,  d,  m,  h),  we  can  partition  the  group  similarly  to 
what  we  did  before.  Since  no  shadow  is  possible  outside  the  h  x  n  area,  by  replacing  every 
occurrence  of  n  with  h,  the  two  theroems  Theorem  5  and  Theorem  6  can  be  applied  to  a 
group  G  (k,  d,  m,  ft)  when  d  <  ft  or  d  ^  ft  respectively. 

The  importance  of  this  extension  is  that  some  of  the  most  frequently  used  multicasting 
patterns  in  image  processing'^  can  now  be  analysed  in  term  of  SF  partitions.  For  example, 
a  four-neighbour  group  around  any  processor  can  be  represented  by  a  group  with  k  =  1. 
d  -  2,  m  =  and  /i  =  3,  or  <7  (1,  2.  3,  4),  and  each  multicasting  operation  to  the  group 
can  be  done  in  one  cycle.  If  we  allow  a  processor  to  send  a  multicasting  message  to  itself, 
then  multicasting  to  its  eight  neighbours  and  itself,  which  is  a  group  of  C  (0,  1,  9,  3),  can 
be  done  in  one  cycle.  Similarly,  multicasting  to  neighbouring  w  x  w  processors,  as  mentioned 
at  the  beginning  of  this  section,  can  also  be  done  in  one  cycle.  By  mapping  hierarchical 
multigrids^  or  pyramid  structures  properly  onto  the  logical  2-D  structure,  each  processor 
can  multicast  to  neighbouring  processors  or  processors  at  the  next  level  in  one  cycle. 


4.2.  Arbitrary  rrtulticastirtg  patterrts 

As  discussed  above,  there  is  a  systematic  way  to  construct  a  SF  partition  for  any  regular 
group.  In  order  to  construct  a  maximal  SF  partition,  we  need  to  merge  subgroups  of  a  SF 
partition  together  as  long  as  the  newly  merged  subgroups  can  still  be  multicasted  svithout 
shadows.  Similar  partitioning  procedures  can  also  be  used  for  arbitrary  multicasting  patterns. 
The  proposed  partitioning  algorithm  presented  below  consists  of  two  parts.  The  first  part  is 
to  construct  a  SF  partition  and  the  second  part  is  to  merge  subgroups  to  construct  a  maximal 
SF  partition. 

For  purposes  of  simplicity,  we  assume  that  three  select  pulses  are  used,  namely  row  select 
pulses  VV'  1  and  two  diagonal  select  pulses  W  3  and  W  4.  The  first  p.irt  of  the  algorithm 
partitions  the  whole  group  into  at  most  f”]  non-empty  subgroups.  This  is  done  by  putting 
processors  of  the  group  at  [-]  rows  apart  into  one  subgroup.  Such  a  pt.rtitiun  is  a  SF 
partition  as  stated  in  the  following  lemma. 

Lemma  3.  No  shadows  are  possible  when  multicasting  to  a  group  ot  processtsrs  located  at 

two  rows  which  are  at  least  [^]  apart  in  an  n  x  tt  array  with  three  >elect  pulses  IV'  1.  VK  3 
and  W  4. 

Proof.  (By  contradiction.)  Let  the  two  rows  be  r,  and  r^.  According  to  the  condition  SC, 
above,  a  shadow  X  must  be  at  one  of  the  two  rows,  assume  it  is  row  r,.  According  to  the 
shadow  conditions  JC3  and  iC*  above,  there  must  be  two  processors  at  row  r-  such  that 
their  respective  45“  and  —45“  diagonal  lines  intersect  at  X.  Thcrciore.  the  distance  heiwec.t 
these  two  processors  should  be  two  times  the  disttmee  between  the  two  rows  /  ,  and  r-.  that 

IS  2  X  [  ^  |.  which  is  no  less  than  n.  Since  these  two  processors  are  .it  the  same  row  r.. 
their  distance  could  never  exceed  ti  -  1.  hence  no  shadows  .ire  possible  ■ 
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The  second  pari  of  the  algo:ithni  first  computes  a  forhitIJen  ■^et  I'or  each  suhgr  mp  of  the 
abo\e  SF  partition.  A  row  is  said  to  be  forbidden  by  a  subgroup  5,  if  multicastire  to  5,  and 
processors  of  the  group  G  located  at  that  row  will  create  at  least  one  shadow.  A  forbidden 
set  for  a  subgroup  S,  is  a  set  of  rows  forbidden  by  Sj.  Two  subgroups  are  merged  together 
if  neither  contains  processors  at  a  row  which  is  forbidden  by  the  other.  The  new  forbidden 
set  for  the  merged  subgroup  is  computed  by  adding  certain  ro.es  into  the  union  of  the  twiT 
onginal  forbidden  sets.  When  no  further  merges  are  possible,  the  algorithm  stops  and  a 
ma.ximal  SF  partition  is  constructed. 

The  time  comple.xity  of  the  algorithm  is  O  (/i  +  m-)  where  m  is  the  number  of  processors 
in  a  group.  This  is  because  the  first  part  of  the  algorithm  t..kes  O  (n)  time.  Careful  anaKsis 
shows  that  the  second  part  of  the  algorithm  takes  O  (nr)  time. 

It  is  clear  that  no  shadow  is  possible  in  a  group  of  less  than  three  processors  when  three 
select  pulses  are  used.  Therefore,  any  maximal  SF  partition  will  always  have  less  than 

[^1  subgroups.  Hence  the  algorithm  will  generate  at  most  min  [y])  subgroups. 

"The  partitioning  algorithm  can  be  executed  incrementally  when  the  multicasting  ptittern 
changes.  Adding  or  deleting  a  processor  from  a  group  to  be  mu'ricasted  involves  possibly 
spliting  an  existing  subgroup,  forming  a  new  subgroup  which  cor  s  processors  at  the  same 
row  and  finally  re-merging  any  subgroups  which  have  since  been  changed. 

If  the  column  select  pulses  W  2  are  used  instead  of  the  row  select  pulses  W  1,  the  algorithm 

above  can  be  adapted  accordingly  by  partitioning  columns  which  are  columns  apart 

into  subgroups  and  merging  them  into  a  maximal  SF  partition.  If  all  four  select  pulses 
mentioned  above  are  used,  we  can  start  with  either  SF  partitions  and  augment  the  condition 
which  determines  if  a  row  (or  a  column)  is  forbidden  by  a  certain  subgroup  and  merge 
subgroups  together  to  achieve  a  maximal  SF  partition. 

Figure  9  shows  the  simulation  results  on  the  number  of  subgroups  generated  by  the 
algorithm.  For  a  20  x  20  processor  system  in  Figure  9(a),  when  all  400  processors  are 
multicasted,  there  are  no  unintended  processors  at  ail  and  that  is  why  the  number  of 
subgroups  is  reduced  to  1.  If  the  number  of  subgroups  in  a  maximal  SF  partition  of  a  group 
G  is  g,  then  the  time  needed  to  transmit  g  address  frames,  one  in  each  cycle,  is 
g  X  (2  X  /I  -  1)  X  Ch  in  the  two-level  addressing  implementation.  However.  .V  x  is 
needed  m  a  unary  addressing  implementation.  Therefore,  the  speed-up  is  at  least  n/(2  x  g). 
Noting  that  g  can  not  exceed  n  2.  the  worst-case  speed-up  is  1.  As  shown  in  Figure  9(b). 
with  three  select  pulses,  the  average  number  of  subgroups  needed  to  multicast  to  50 
processors  in  a  50  x  50  processor  svstem  is  only  about  5.  Thus,  a  speed-up  of  5  is  obtained. 


5.  CONCLUSION 

In  this  paper,  coincident  pulse  techniques  have  been  applied  as  an  efficient  addressing 
mechanism  for  multicasting  among  multiprocessors  connected  by  optical  buses.  Two  basic 
models  of  a  unary  addressing  implementation  have  been  discussed,  and  a  two-level  addressing 
implementation  has  been  proposed  to  reduce  the  address  frame  length.  Two  approaches  to 
deal  with  the  shadow  problem  have  been  presented.  One  approach  reduces  the  number  of 
shadows  by  using  check  pulses.  .Another  approach  avoids  possible  shadows  by  constructim; 
SF  partitions  It  has  been  shown  that  for  regular  multicasting  patterns.  SF  partitions  can  be 
cnn-,tructed  stemaiicall’.  and  prccessors  can  mu.ticast  to  their  >.  ’niiiiuiiicating  processors 
within  k)ne  c-cle  m  manv  applications.  A  partitiomnc  algorithm  has  ,iKo  been  presented  for 
arbitrarv  multicasting  patterns  T'-e  overall  results  ot  ;he  two  level  .iddrcssine  implementation 
•ire  higher  e..hcienc\ .  lo'ver  mm, .mum  optical  path  requirements  and  potential  speed-ups 
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number  of  lubgrnups 


number  of  subgroup! 


(b)  number  of  subgroups  for  a  group  of  m  sfl  pnxesson  in  a /ixn  sygem 

Figure  9  Simulanon  resulis  of  ihe  parimoning  algonthm 


This  reinforces  our  belief  that  coincident  pulse  techniques  are  a  promising  addressing 
mechanism  which  can  be  applied  in  both  parallel  memory  structures  and  multiprocessor 
systems.  Finally,  we  note  that  in  this  paper,  many  technical  aspects  or  pulse  generation, 
coincidence  detection,  power  distribution  and  other  related  issues  have  not  been  discussed. 
They  can  be  found  in  References  5,  9  and  12. 


APPENDIX 

As  mentioned  in  Section  3.1,  the  amount  of  delay  that  is  added  on  the  row  select  and  the 
column  select  waveguides  can  be  obtained  by  solving  a  set  of  underconstrained  equations. 
In  the  following  discussion,  a  folded  bus  model  of  /V  =  n-  processor  is  asiumed  and  si  me 
of  the  previous  defined  notation  is  used.  Additional  notation  is  defined  as  follows. 

Row  (i).  Col  (i): 

the  number  of  delay  loops  added  on  the  receiving  segment  or  tlie  r.  'v  select  .md  t'lc 
column  select  waveguides,  respectively,  between  processor  i  -  1  and  processor  i.  where 
0  <  /  <  ,V, 


OITICAL  MULflCASlINf;  IN  LINi:  VK  \KKA^S 


D  (Ll).  Dm): 

>.•  the  transtnisbion  time  of  the  row  select  pulse  L'  nnd  the  column  select  pulse  L  . 

jac  respectively,  relative  to  the  transmission  time  of  the  reference  puise.  where  0  s  /  <  n. 

Given  that  there  is  one  delay  loop  beiween  any  two  adjacent  processors  on  the  receiving 
segment  of  the  reference  waveguide,  we  have  the  following  set  ot  equations  for  the  row 
ill  select  waveguide: 


TaL\) 


^  Row  ij) 


=  r,,.  +  k. 
^  m  -  k. 


if  1  X  /I  s ;-  <  (/  +  i)/i 
otherwise 


(A.!a) 


5>i  Note  that,  the  number  of  delay  loops  added  on  the  reference  waveguide  has  been  lived 
■  u  therefore  a  degree  of  freedom  has  been  remosed  alreadv.  Equation  (A.  la)  states  tli.it  the 
•  i  pulse  Ll  should  coincide  with  the  reference  pulse  at  all  processors  at  row  i  but  should  not 
coincide  with  the  reference  pulse  at  any  other  processors.  Since  T^.,(Ll)  -  =  D(Ll). 

5”  we  can  simplify  the  above  equation  to 


k 


DiL'i)  +  U) 

/-tJ 


1  =^- 

[  *  k. 


if  I  X  /?  <  ^  <  ((  +  l)/i 

othersvise 


(A. lb) 


•x  Clearly,  two  different  pulses  cannot  be  transmitted  at  the  same  time,  therefore,  the  following 
'ii  equation  has  to  be  satisfied  also: 


Vtl 


(0(£',)^D(L'.). 


(A.lc) 


w;  These  two  equations,  namely  (A.  la)  and  (A. lb),  are  underconstrained  since  both  D  (L,')s 
lui  and  Row  {j)s  are  variables.  Note  that  the  values  of  D  {Ll)s  will  determine  the  address  frame 
^  length  and  therefore  we  choose  to  fix  them  first  and  solve  the  equation  for  Row  (j)s.  If 
‘i>  D  (Z-i)s  are  fixed  such  that  D  (Ll)  ~  i.  we  can  solve  the  above  equations  to  get 


f  0 

Ro>'  ly)  =  I  I 


if /  =  t  <  n 
otherwise 


Note  that,  if  the  D  (Ll)i  are  fixed  such  that  D  (Ll)  =  we  will  get  the  same  result  as  in 
-  Section  3. 

Similarly,  we  can  have  the  following  set  of  equations  for  column  select  wavecuide 


D(L':)  -  ^Co/(j) 


if  A:  =  (  -  rn/r.  where  0  s  /n  <  n 
otherwise 


(A.Za) 


J'<  D  t  L':)  ^  D  (L'2).  '/ 1  - 1  (A. Zb) 

>;  It  can  be  shown  that  by  fixing  D  (f.-)  =  1.  we  can  get  the  result  as  m  Section  3  Equations 
'■>  and  their  solutions  for  check  waveguides  can  be  similarK  constructed. 
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Pipelined  Communications  in  Optically  Interconnected  Arrays* 
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Two  synchronous  multiprocessor  architectures  based  on 
pipelined  optical  bus  interconnections  are  presented.  The  first 
is  a  linear  pipeline  with  enhanced  control  strategies  which  make 
optimal  use  of  the  available  communication  bandwidth  of  the 
optical  bus.  The  second  is  a  two-dimensional  architecture  in 
which  processors  are  placed  in  a  square  grid  and  interconnected 
to  one  another  through  horizontal  and  vertical  pipelined  optical 
buses.  These  architectures  allow  any  two  processors  to  com¬ 
municate  with  each  other  using  one  (for  the  linear  case)  or  two 
(for  the  two-dimensional  case)  pipelined  bus  cycles.  Further, 
they  permit  all  processors  to  have  simultaneous  access  to  the 
buses  using  slots  within  a  pipelined  cycle.  We  show  that  the 
architectures  have  simple  control  structures  and  that  well-known 
processor  interconnections,  e.g.,  the  complete  binary  trees  and 
the  hypercube  networks,  can  be  efficiently  embedded  in  them. 
These  architectures  have  an  effectively  higher  bandwidth  than 
conventional  bus  configurations  and  appear  to  be  good  candi¬ 
dates  for  a  new  generation  of  hybrid  optical-electronic  parallel 
computers,  e  mi  Academic  Preu.  Inc 


1  INTRODUCTION 

Two-dimensional  meshes  of  processors  have  been  exten¬ 
sively  studied  in  vanous  forms  and  augmentations  (23,  26, 
37].  Large-scale  implementations  of  two-dimensional 
meshes  have  been  built  (2,  10,  17],  However,  since  the  com¬ 
munication  diameter  of  an  n  x  n  mesh  is  0(n),  different 
approaches  have  been  considered  to  augment  the  commu¬ 
nication  capabilities  of  the  mesh  to  reduce  this  diameter. 
Meshes  have  been  augmented  with  global  buses  (3,  10,  11, 
35).  reducing  the  communication  diameter  but  giving  only 
very  small  bandwidth  improvements.  Row  and  column  bus 
augmentations  (29.  30]  have  yielded  both  a  low  commu¬ 
nication  diameter  and  adequate  bandwidth  for  certain  classes 
of  algorithms.  Interconnection  networks  have  been  consid¬ 
ered  for  augmenting  rows  and  columns  in  a  mesh  including 
trees  [27,  28,  39]  and  compounded  graphs  [18,  19],  The 
binary  hypercube  can  also  be  viewed  in  this  context  as  a  two- 
dimensional  mesh  with  horizontal  and  vertical  hypercube 
interconnections  [  1 8.  19], 


•  This  work  was.  in  pan.  supported  by  Air  Force  Grant  AFOSR*89-O4ft9 
and  by  NSF  Grant  MIP-890i053 


One  of  the  simplest  mesh  augmentation  schemes  is  the 
row  and  column  bus  augmentation.  However,  exclusive  write 
access  to  buses  is  a  major  contributor  to  the  low  bandwidth 
of  bus  interconnections.  A  unique  property  of  optics  provides 
an  alternative  to  this  exclusive  access,  namely,  the  ability  in 
optics  to  pipeline  the  transmission  of  signals  through  a  chan¬ 
nel.  In  electronic  buses,  signals  propagate  in  both  directions 
from  the  source,  while  optical  channels  are  inherently  di¬ 
rectional  and  have  precise  predictable  path  delays  per  unit 
distance.  Hence,  a  pipeline  of  optical  signals  may  be  created 
by  the  synchronized  directional  coupling  of  each  signal  at 
specified  locations  along  the  channel.  This  property  has  been 
used  to  parallelize  access  to  shared  memory  [  5  ] ,  to  enhance 
the  bandwidth  in  bus-connected  multiprocessor  systems 
[22],  and  to  minimize  the  control  overhead  in  networking 
environments  [38], 

In  this  paper,  we  present  two  multiprocessor  architectures, 
called  .-irrav  Processors  with  Pipelined  Buses  (APPB).  which 
employ  optical  bus  interconnections  in  processor  arrays.  In 
Section  2  we  review  the  basic  principle  of  pipelining  messages 
on  optical  buses.  In  Section  3  we  introduce  our  linear  APPB. 
where  processors  are  connected  with  a  single  optical  bus.  We 
present  efficient  approaches  to  message  routing  and  network 
embedding  for  the  linear  APPB  as  well  as  techniques  for 
enhancing  the  bus  utilization  through  enhanced  control 
functions.  In  Section  4  we  introduce  our  two-dimensional 
APPB.  where  processors  are  interconnected  with  horizontal 
and  vertical  optical  buses.  We  discuss  routing  and  embedding 
issues  for  this  new  architecture.  We  show  how  binary  tree 
and  hypercube  interconnections  can  be  effectively  embedded 
and  identify  key  design  issues  for  effective  embeddings  of 
arbitrary  interconnections.  In  Section  5  we  compare  the  ef¬ 
ficiency  of  the  pipelined  bus  communication  model  with 
that  of  nonpipelined  buses  and  of  store  and  forward  com¬ 
munications  in  nearest-neighbor  structures.  Finally,  Section 
6  contains  concluding  remarks. 

2.  MESSAGE  PIPELINING  ON  OPTICAL  BUSES 

Consider  the  system  of  Fig.  la,  where  n  processors,  each 
having  a  constant  number  of  registers,  are  connected  through 
a  single  optical  waveguide  (bus).  Each  processor  is  coupled 
to  the  optical  waveguide  with  two  passive  couplers,  one  for 
injecting  (wnting)  signals  on  the  waveguide  and  the  other 
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FIG.  I.  (a)  A  system  of  n  processors  connected  with  a  single  optical 
waveguide  (bus),  (b)  A  linear  array  of  n  processors  with  nearest-neighbor 
connections. 

for  receiving  ( reading )  signals  from  the  waveguide  ( 20,  40 ) . 
Each  receiving  coupler  passively  taps  a  percentage  ( typically 
5-10%,  depending  on  the  coupling  ratio)  of  the  optical  signal 
power  available  on  the  bus.  Thus  the  couplers  do  not  intro¬ 
duce  any  delay  to  the  propagation  of  optical  signals  along 
the  bus.  However,  the  degradation  of  signal  power  does  place 
an  upper  limit  on  the  number  of  processors  that  can  be  con¬ 
nected  on  the  bus  [8].  As  in  the  case  of  electronic  buses, 
each  processor  j  communicates  with  any  other  processor  / 
by  sending  a  message  to  t  through  the  common  bus.  However, 
because  optical  signals  propagate  in  one  direction,  a  processor 
j  may  send  signals  to  another  processor  i  only  if  i  >  j. 

Assume  that  a  message  on  an  optical  bus  consists  of  a 
sequence  of  optical  pulses,  each  having  a  width  w  in  seconds. 
The  existence  of  an  optical  signal  of  width  w  represents  a 
binary  bit  1 ,  and  the  absence  of  such  a  signal  represents  a  0. 
Note  that  h'  includes  a  time  for  electro-optical  conversions, 
rise  and  fall  times,  and  propagation  delay  in  the  latch  of  the 
receiver  circuits  [6).  For  analytical  convenience,  we  let  D„ 
be  the  optical  distance  between  each  pair  of  adjacent  nodes 
( it  will  become  clear  that  the  distance  between  two  adjacent 
nodes  need  not  be  equal )  and  t  be  the  time  taken  for  an 
optical  pulse  to  traverse  the  optical  distance  To  transfer 
a  message  from  a  node  j  to  node  i.  i  >  j.  the  sender  j  writes 
its  message  on  the  bus.  After  a  time  (/  -  j)T  the  message 
will  arrive  at  the  receiver  /.  which  then  reads  the  message 
from  the  bus. 

The  properties  of  unidirectional  propagation  and  pre¬ 
dictable  path  delays  of  optical  signals  may  be  used  advan¬ 
tageously.  Specifically,  unlike  the  electronic  case,  where  the 
wnting  access  to  the  bus  by  each  node  must  be  mutually 
exclusive,  all  nodes  in  the  system  of  Fig.  la  can  wnte  on  the 
bus  simultaneously,  provided  that  the  following  collision- 
free  condition  [22]  is  satisfied. 

Dn  >  hwCf.  ( 1  ) 

where  h  is  the  number  of  binary  bits  in  each  message,  and 
r,  IS  the  velocity  of  light  in  the  waveguide.  Clearly  if  this 
condition  is  satisfied  and  the  system  is  synchronized  such 
that  every  ncxle  starts  writing  a  message  on  the  bus  at  the 


same  instant,  then  no  two  messages  injected  on  the  bus  by 
any  two  distinct  nodes  will  collide.  Here  by  colliding  we 
mean  that  two  optical  signals  injected  on  the  bus  by  any  two 
distinct  nodes  arrive  at  some  point  on  the  bus  simultaneously. 
This  kind  of  synchronized  pulse  generation  is  restrictive  but 
it  can  be  met  in  several  ways  [21].  An  optically  distributed 
clock  can  be  broadcast  without  skew  to  each  node,  or  elearo- 
optical  switches  can  be  used  in  place  of  sources  to  “switch 
in”  pulses  generated  from  a  single  source.  With  this  condition 
satisfied,  every  node  can,  in  parallel,  send  a  message  to  some 
other  node,  and  the  messages  will  all  travel  from  left  to  right 
on  the  bus  in  a  pipelined  fashion,  as  shown  in  Fig.  2.  Thus 
we  use  the  term  pipelined  bus.  In  the  rest  of  this  paper  we 
always  assume  that  the  collision-free  condition  ( 1 )  is  satisfied. 

To  facilitate  our  discussion  in  subsequent  sections  we  de¬ 
fine  some  terms.  Let  r  be  defined  as  before  and  n  be  the 
number  of  nodes  on  the  pipelined  optical  bus.  We  define  m 
as  a  bus  cycle  and  correspondingly  r  as  a  petit  cycle.  Note 
that  a  bus  cycle  is  the  time  taken  for  an  optical  signal  to 
traverse  the  entire  length  of  the  optical  bus.  For  the  discussion 
in  this  section,  we  do  not  include  in  a  bus  cycle  the  time 
taken  to  prepare  and  process  a  message  before  it  can  be  in¬ 
jected  on  the  bus.  This  time  is  explicitly  introduced  in  our 
performance  analysis  in  Section  5.  If  every  node  is  writing 
a  message  simultaneously  on  the  bus,  then  each  node  has  to 
wait  for  at  least  a  bus  cycle  to  inject  its  next  message.  Note 
that  each  cycle  on  the  pipelined  bus  may  be  emulated  by  n 
cycles  in  a  linear  array  with  nearest-neighbor  communica¬ 
tions  shown  Fig.  lb.  Comparison  of  the  two  interconnection 
schemes  is  made  in  Section  5. 

Let  us  look  at  a  simple  routing  task  where  each  node 
transmits  a  message  and  each  node  is  programmed  to  receive 
a  message  from  the  A.th  node  ( if  it  exists)  to  its  left.  All  nodes 
start  injecting  messages  at  the  beginning  of  a  bus  cycle,  and 
all  the  messages  travel  on  the  optical  bus  in  pipelined  fashion 
without  collision.  By  waiting  for  a  specific  interval  of  time, 
a  node  can  selectively  read  the  message  intended  for  it  as 
that  message  passes  by  the  node.  In  our  example,  each  node 
i  is  to  receive  a  message  from  node  i  -  k  and  thus  must  read 
its  message  from  the  bus  after  kr  time  from  the  beginning 
of  the  bus  cycle.  In  this  way.  a  message  routing  pattern  in 
which  each  node  sends  a  message  to  the  A:th  node  to  its  right 
has  been  realized.  In  fact,  as  will  be  seen,  we  can  realize 
various  message  routing  patterns  in  a  simple,  straightforward 
way. 

3.  LINEAR  ARRAY  PROCESSORS 
WITH  PIPELINED  BUSES 

In  the  system  of  Fig.  la,  messages  can  be  transmitted  only 
from  left  to  nght.  To  allow  message  passing  from  nght  to 


FIG.  2.  Message  pipelinint:  on  ihc  optical  hus  A  Wank  rectangle  indicates 
"no  signal.  "  implsing  that  some  pnxessor  is  not  sending  a  message 
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left,  another  optical  bus  is  used,  as  shown  in  Fig,  3a.  In  this 
figure,  we  have  two  optical  buses;  the  upper  one  is  used  for 
sending  messages  from  left  to  nght,  and  the  lower  one  is  used 
for  sending  messages  from  nghi  to  left.  Each  node  can  wnte 
and  read  messages  on  either  bus  as  desired.  Obviously  signals 
in  different  buses  do  not  disturb  one  another;  that  is,  the 
two  buses  can  support  two  separate  pipelines.  The  system 
in  Fig.  3a  is  our  architecture  of  linear  .■\PPB  For  convenience 
the  linear  APPB  in  Fig.  3a  is  schematicallv  drawn  as  in 
Fig.  3b. 

To  specify  the  time  at  which  a  node  should  receive  a  mes¬ 
sage.  we  introduce  a  control  function  i).  which  is  de¬ 
fined  as  the  time  that  node  i  should  wait,  relative  to  the 
beginning  of  the  bus  cycle,  before  reading  the  message  sent 
to  it  from  some  other  node  j.  Thus 

IwuiH  l)  =  {I  ~  j)r . 

If  T  is  considered  as  a  time  unit,  then  mait  can  be  interpreted 
in  terms  of  the  number  of  such  time  units  and  thus  be  written 
twait(i)  =  i  ~  I  -  Clearly  if  iwatH  i)  >  0,  then  the  message  is 
to  be  received  from  the  left;  if /wuit(  i )  <  0.  then  the  message 
is  to  be  received  from  the  right.  If  n^aiHi)  =  0.  then  no 
message  should  be  received  by  node  i.  The  value  of  t^ait(i) 
can  be  stored  in  a  wait  register,  and  more  than  one  such 
register  may  be  used  if  a  node  is  to  receive  more  than  one 
message  in  one  bus  cycle. 

This  twait  control  function,  however,  has  the  disadvan¬ 
tages  that  it  depends  crucially  on  timing  accuracy  and  is 
sensitive  to  the  optical  distance  between  two  adjacent 
nodes.  .An  equivalent  control  function,  nmait.  that  does  not 
have  these  disadvantages  may  be  defined  if  we  require  that 
each  node  inject  a  message,  real  or  dummy,  every  bus  cycle. 
In  this  we  define  mwatKi)  as  the  number  of  messages 
that  node  i  should  skip  before  reading  its  message.  For  ex¬ 
ample.  if mwaiK  i)  =  y.  then  node  /  should  receive  the  i 7 1 th 
message  that  passes  /  on  the  bus.  That  is.  it  has  to  wait  until 
I  7 !  -  1  messages  have  passed  and  then  it  reads  its  own 
message.  The  sign  of  7  determines  on  which  bus  the  message 
should  be  received.  Clearly  mwait  is  equivalent  to  iwait  and 
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either  control  function  may  be  used.  For  convenience  we 
simply  wnte  the  control  function  as  wail,  and  we  assume 
that  the  optical  distance  between  each  pair  of  adjacent  nodes 
/  and  /  T  1  is  constant. 

The  control  function  wait  can  only  be  used  when  the 
communication  pattern  is  known  to  the  receiver  in  the  sense 
that  the  receiver  knows  from  which  node  the  message  is  to 
be  received.  In  cases  where  the  communication  pattern  is 
unknown  to  the  receiver,  the  coincident  pulse  techniques 
(5.  21)  may  be  used  such  that  an  addressing  pulse  and  a 
reference  pulse  coincide  at  the  detector  of  the  receiver, 
thereby  addressing  it.  In  this  paper  we  use  wait  for  addressing 
since  the  communication  patterns  which  we  discuss  are 
known  to  the  receiver. 

In  the  following  we  present  techniques  for  message  routing 
and  network  embedding  in  the  linear  APPB.  For  the  purpose 
of  evaluating  the  communication  efficiency,  we  note  that  a 
lower  bound  on  the  number  of  bus  cycles  needed  to  transfer 
H  messages  in  the  linear  APPB  is  fH/n].  where  n  is  the 
number  of  nodes  on  the  optical  bus.  This  lower  bound  is 
obtained  by  assuming  a  perfectly  even  distribution  of  mes¬ 
sages  along  the  bus  at  each  bus  cycle,  that  is.  e\ery  node  has 
one  message  to  send  at  each  bus  cycle. 

3.1.  Message  Routing  in  Linear  APPB 

Various  message  routing  patterns  can  be  realized  in  a  sim¬ 
ple,  straightforward  way.  Since  a  routing  pattern  is  deter¬ 
mined  by  the  -^ait  functions,  we  need  only  determine  these 
wait  functions  for  each  routing  pattern.  The  most  common 
patterns  are: 

One-tchOne  The  system  executes  a  SEND[  j.  i)  instruc¬ 
tion.  which  means  that  a  message  is  to  be  transferred  from 
node  J  to  node  i.  Thus.  wait{  i)  =  i  -  j.  where  i  is  a  single 
specific  node. 

Broadcast  The  system  executes  BROADCASTi j). 
which  means  that  node  j  broadcasts  a  message,  and  all  other 
nodes  t  will  receive  that  message.  In  this  case,  waitii)  -  i 
-  j  for  all  i  ^  J. 

Semigroup  Communication  [4]  The  system  executes  a 
SE.MfGROi  P(  i )  instruction,  which  says  that  some  global 
information,  e  g.,  extrema  and  sum.  is  to  be  computed  and 
stored  at  node  1.  This  task  can  be  accomplished  by  having 
the  linear  APPB  logically  function  as  a  tree  with  the  root 
being  node  i.  Later  in  this  section  we  present  embeddings 
of  binary  trees  w  hich  facilitate  such  a  tree  emulation  task. 

Permutations  For  each  node  ;  to  send  a  message  to  a 
node  i  =  PERM!  y),  where  PERMi  )  is  an  arbitrary  per¬ 
mutation,  we  set  wait{  i)  =  i  -  1  (or  all  i . 

We  see  that  the  computation  of  wait{  i)  is  very  simple  and 
uniform  The  only  difference  among  the  wait  functions  for 
different  message  routing  patterns  is  that  the  nodes  involved 
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are  different.  It  is  clear  that  all  these  communication  tasks 
can  be  performed  using  a  single  bus  cycle,  except  the  semi¬ 
group  communication,  which  takes  login )  bus  cycles.  Note 
that,  in  the  linear  APPB,  message  passing  between  two  nor.- 
neighboring  nodes  is  nearly  as  efficient  as  that  between  two 
neighbors.  Specifically,  a  message  takes  t  more  time  to  pass 
one  more  node  on  the  optical  bus.  This  is  not  the  case  in 
the  linear  array  with  nearest-neighbor  connections  shown  in 
Fig.  lb,  where  to  pass  a  node,  en  route  to  another  node,  a 
message  has  to  go  through  a  router.  In  this  sense  we  may 
say  that  the  APPB  is  communication  efficient,  and  in  par¬ 
ticular  global-communication  efficient. 


Thus,  to  realize  children-to-parent  message  routing  each 
parent  should  wait  for  wait^o(‘)  wait^jd)  time  to  read 
the  messages  from  its  left  and  right  child,  respectively.  Clearly 
this  routing  task  can  be  performed  using  one  bus  cycle. 

For  parent-to-children  message  transfer  in  £,, ,  each  parent 
has  two  messages  to  send  to  its  two  children,  respectively. 
In  this  case,  two  bus  cycles  are  needed  to  carry  out  such  a 
routing  task,  one  to  send  messages  to  left  children  and  one 
to  send  messages  to  right  children.  Let  waiipo(j)  and 
waitj,  ,{j)  be  the  wait  functions  for  a  left  child  and  right  child, 
respectively,  to  receive  a  message  from  its  parent.  Then,  dur¬ 
ing  the  first  cycle  we  have 


3.2.  Embedding  Binary  Tree  and  Hypercuhe  \eiworks 

into  Linear  APPB 

In  this  subsection  we  show  how  to  embed  other  intercon¬ 
nection  networks  into  the  linear  APPB.  Our  first  example  is 
the  embedding  of  complete  binary  tree  networks.  To  show 
that  a  binary  tree  network  can  be  embedded  in  the  linear 
APPB  it  is  sufficient  to  find  the  wan  function  for  each  pro¬ 
cessor  in  the  linear  APPB  such  that  the  desired  routing  pat¬ 
tern  is  accomplished. 

Let  L  be  the  number  of  levels  of  a  complete  binary  tree 
and  let  the  root  of  the  tree  be  node  1.  Each  node  /.  i  >  \. 
which  is  not  a  leaf  node  has  two  children,  2/  +  5,  where  b 
=  0,  1.  corresponding  to  Ts  left  and  right  child,  respectively 
(see  Fig.  4a  for  an  example).  Consider  an  embedding  in 
which  node  /  in  the  tree  is  mapped  to  node  i  -  1  in  the  linear 
APPB.  For  convenience,  we  call  this  embedding  £,|  (see  Fig. 
4b)  .  In  £,, ,  the  wait  functions  for  node  i  to  receive  a  message 
from  Its  children  are: 


waii,.,di) 


i  -  {2i  +  b)  =  -{ i  i-  5).  i  < 

0.  otherwise. 


waited  j)  = 


J  J 


j  =  even, 
otherwise. 


and  during  the  second  cycle  we  have 
7-1  7+1 


waitj,,Aj)  = 


7  - 

0. 


J  =  odd.  and  7  1 . 

otherwise. 


Mapping  each  node  /  in  the  binary  tree  network  onto  node 
1  (or  1  -  1  as  was  just  done  above)  in  the  linear  APPB  is  a 
straightforward  approach.  Using  this  straightforward  ap¬ 
proach  we  can  embed  any  type  of  network  in  the  linear 
APPB.  This  approach,  however,  may  not  give  a  good 
embedding  in  the  sense  that  it  may  take  more  time  than 
needed,  in  number  of  bus  cycles,  to  accomplish  a  given  com¬ 
munication  task.  As  is  seen  next,  another  tree  embedding, 
£,;.  has  a  better  communication  efficiency  than  £,i . 

Embedding  £,;  may  be  viewed  as  pressing  the  binary  tree 
from  the  root  down  until  all  the  nodes  fall  in  the  level  of  the 
leaf  nodes  (see  Fig.  4c).  In  this  embedding  the  two  children 
of  a  node  i  are  on  opposing  sides  of  i.  Thus  the  parent-to- 
children  routing  pattern,  as  well  as  the  children-to-parent 
routing  pattern,  may  be  accomplished  in  one  bus  cycle.  Spe¬ 
cifically.  if  i  is  a  node  at  level  /,  where  /  is  the  integer  satisfying 
2'  -  I  <  1  <  2'*'.  then  the  wait  functions  for  ;  to  receive 
the  messages  from  its  two  children  are 


Til 


A  : 


wan,  id) 


(-l)‘2'  '  \  (<2'  ', 

0,  otherwise. 


(t» 


(cl 

FIG.  4,  Embeddings  of  complete  hinar.  trees  in  the  linear  XPPB  (a) 
•\  binan.  tree  i  b  i  The  hrst  embedding,  f.  i  c  i  The  second  embedding.  E... 


The  parent-to-children  message  routing  pattern  in  £0  is 
different  from  that  in  £,,  in  that  the  two  messages  from  a 
parent  will  travel  on  two  different  buses.  Then  the  two  mes¬ 
sages  from  each  parent  node  can  be  simultaneously  injected 
on  the  two  buses,  respectively,  in  the  same  bus  cycle.  Hence, 
the  parent-to-children  routing  pattern  can  be  accomplished 
in  one  bus  cycle,  wun^i  can  be  determined  by  noting  that 
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where  i  is  the  parent  of  j.  That  is, 
the  wait  functions  for  parent-to-children  message  transfer 
are 

[  (-I)*"':'--''',  ;  >  1. 

*  lo.  ,  -  1 

Next,  we  consider  a  ^-dimensional  binary  hypercube  in 
which  the  nodes  are  numbered  such  that  if  nodes  i  and  j  are 
neighbors  across  dimension  h,  1  ^h^fc.then  li-^j  =  2*"' 
(see  Fig.  5a).  Let  £ci  be  the  embedding  of  this  X:-cube  into 
a  linear  APPB  such  that  each  node  i  in  the  hypercube  is 
mapped  into  node  i  m  the  linear  APPB.  With  this  embedding, 
a  node  in  the  hypercube  may  send  a  distinct  message  to  each 


of  its  k  neighbors  if  each  node  sends  one  message  to  one 
neighbor  in  each  bus  cycle.  For  example,  at  the  hih  bus  cycle 
a  message  is  sent  from  each  node  to  its  neighbor  .  distance 
2*"‘ .  To  accomplish  this,  the  time  that  a  node  /  has  to  wait 
during  the  hth  bus  cycle  before  receiving  a  message  from  its 
neighbor  along  the  /tth  dimension  is 

wailhU)  =  ±2^'“. 

In  our  discussions  so  far,  we  have  allowed  each  node  to 
send  only  one  message  on  each  bus  during  each  bus  cycle. 
In  other  words  after  placing  a  message  on  the  bus  in  the 
current  cycle,  all  nodes  must  wait  until  the  next  cycle  to 
initiate  the  next  message.  In  the  following  subsection,  we 
show  that  such  a  wait  is  not  always  necessary. 


2  (^BC  iSBtD  Q^ED  tXBD 


(c) 

FtG.  5.  (al  X  binary  hvpercube  and  its  dimension  assignment  (bl  Message  routing  patterns  in  the  hypercube  (c)  Message  distribution  in  the 
hypercube 
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3.3.  Interleaved  and  Overlapped  Pipelining 


The  corresponding  wait  functions  are 


Up  until  now,  we  have  required  that  each  node  send  only 
one  message  on  each  bus  in  one  bus  cycle  and  that  the  trans¬ 
mission  of  messages  be  initiated  at  the  beginning  of  a  bus 
cycle.  Given  these  two  restrictions,  no  specific  control  func¬ 
tion  was  needed  for  the  initiation  of  messages.  However,  if 
some  node  does  not  have  a  message  to  send  during  a  bus 
cycle,  a  slot  of  one  petit  cycle  in  duration  will  be  created. 
Interleaved  pipelining  is  a  technique  which  tries  to  fully  uti¬ 
lize  the  communication  capacity  of  the  pipelined  bus  by  in¬ 
serting  a  message  into  any  available  slot.  This  may  be  ac¬ 
complished  if  a  node  is  allowed  to  place  more  than  one  mes¬ 
sage  on  the  same  bus  within  a  bus  cycle,  but  at  different  petit 
cycles.  To  allow  for  this  flexibility,  a  control  function  sendgl  j) 
must  be  used  to  specify  the  time,  relative  to  the  beginning 
of  a  bus  cycle,  at  which  node  j  should  write  its  ^h  message 
on  the  bus. 

To  show  how  interleaved  pipelining  works,  let  us  now 
examine  the  routing  patterns  in  fc .  Since  message  transfers 
in  opposite  directions  on  the  two  buses  of  the  linear  APPB 
form  two  separate  and  symmetric  pipelines,  we  need  to  look 
at  only  one  direction.  Consider  the  left-to-right  message 
transfer  in  Ed,  and  define  k  sets.  S^  =  {j\0<j<n,0 
<  (j  mod  2*)  <  2'’"'  1  ^  h  ^  k.  of  nodes  for  the  A-cube. 

That  is,  Sh  is  obtained  by  partitioning  the  n  nodes  of  the 
hypercube  into  2 '’-node  groups  and  including  in  5*  the  first 
2*''  nodes  in  each  group.  For  example,  for  the  4-cube  in 
Fig.  5a.  we  have  5|  =  {0,  2.  4,  6,  8.  10.  12,  14 } ,  S':  =  {0. 
1.  4.  5,  8,  9.  12.  13[.  S3  =  {0.  1.  2.  3,  8.9.  10,  11}.  andS4 
=  {0.  1.  2,  3.  4.  5.  6.  7  } ,  Note  that  all  the  k  sets,  Sh,  have 
the  same  cardinality  2* ' ' .  and  each  contains  node  0.  Hence, 
in  the  realization  of  the  binary  A-cube  using  a  linear  APPB. 
there  are  A  routing  patterns.  In  the  Ath  pattern,  1  ^  /i  ^  A, 
the  nodes  in  set  Si,  send  messages  to  their  neighbors  along 
the  /ith  dimension  in  the  hypercube,  as  indicated  with  the 
arrowed  curves  in  Fig.  5b.  Correspondingly,  the  messages 
can  be  divided  into  A  sets,  1  ^  ^  A.  which  are  sent  by 

the  A  sets  of  nodes  Si,,  respectively.  For  the  routing  patterns 
in  Fig.  5b,  these  message  sets  are  shown  in  Fig.  5c. 

Using  interleaved  pipelining,  the  messages  in  the  two  sets 
1/d  I  and  A/:,,  1  ^  s  ^  A/2,  are  interleaved  and  sent  in  the 
same  bus  cycle.  Let  send,{  j)  and  sendz(j)  be  the  times  at 
which  node  j  writes  its  messages  in  .V/:,  ,  and  .V/;,.  respec¬ 
tively.  on  the  bus  dunng  bus  cycle  5.  Correspondingly,  let 
waitdi)  and  waitiH)  be  the  wait  functions  for  a  node  i  to 
receive  the  messages  in  .V/:,-i  and  A/:,,  respectively,  during 
bus  cycle  s.  Then,  for  interleaved  pipelining  we  have  the 
following  send  functions  for  a  node  j  at  bus  cycle  5,  1  «  .s 
«  A72: 

send,{j)=0.  /E5:,|, 


send.J  j) 


0,  /  G  5:,  and  2-'  '  ^  <  j  mod  2*')  <  2*’  ' 

2'"  7  E  5:,  and  0  $  ( /  mod  2 ’’)  <  2’'  v 


waii^li)  =  i  -  J,  jES2s-\, 


wait2(i) 


i  -  j.  7  E  and  2^^  - 

<  (7  mod  2-’)  <  2’'-', 
1-7 +  2"^ "7  7  E  and 

0  ^  (7  mod  2-7  < 


A  node  for  which  the  send  cr  wait  function  is  not  defined 
above  should  not  send  or  receive  any  message.  Note  that  the 
times  determined  by  these  send  and  wait  functions  are  with 
respect  to  the  beginning  of  each  bus  cycle  s.  Also  note  that 
since  the  receiving  node  1  knows  the  id  of  the  sending  node 
J  (since  they  are  neighbors  in  the  A-cube),  it  knows  which 
of  the  two  values  of  wa»2(  1)  should  be  used.  As  an  example, 
the  interleaved  pipelining  for  the  messages  in  Fig.  5c  is 
achieved  by  interleaving  message  sets  A/|  and  M2  in  the  first 
bus  cycle  and  .V/3  and  M^  in  the  second  bus  cycle.  The  ar¬ 
rowed  lines  in  Fig.  5c  show  how  the  messages  are  being  in¬ 
terleaved.  and  the  resulting  message  pipelines  are  shown  in 
Fig.  6a. 

It  can  be  seen  that  using  interleaved  message  pipelining, 
the  total  communication  time  taken  for  each  node  to  send 
a  message  to  each  of  its  neighbors  is  A/ 2  +  1  bus  cycles, 
where  the  last  bus  cycle  is  due  to  the  time  needed  to  clear 
out  the  first  n/4  messages  { sent  by  nodes  0  through  n/4-  1 ) 
in  A/*  that  were  inserted  in  front  of  A/t-, .  Comparing  with 
A  bus  cycles,  the  time  needed  if  each  node  sends  one  message 
per  bus  cycle,  ou.  savings  in  the  communication  time  is  (A 
-1)12  bus  cycles.  Although  this  savings  is  significant  there 
are  still  unused  slots  from  the  rightmost  nodes  on  the  bus, 
as  can  be  seen  from  the  message  pipeline  at  time  t  =  16  in 
Fig.  6a.  We  next  show  how  to  utilize  these  empty  slots  using 
overlapped  pipelining. 

In  overlapped  pipelining,  we  pipeline  the  message  pipelines 
obtained  from  interleaved  pipelining  by  allowing  the  mes¬ 
sages  for  bus  cycle  s  to  be  initiated  before  bus  cycle  s  -  I 
terminates,  as  long  as  message  collision  does  not  occur.  For 
this  purpose  we  define  a  new  control  function,  step, ,  which 
specifies  the  time,  with  respect  to  the  beginning  of  the  first 
bus  cycle,  at  which  the  messages  for  bus  cycle  s  are  initiated. 
Clearly,  savings  in  communication  time  is  possible  if  step, 
-  step, ^ I  <  nr.  In  this  case,  we  avoid  confusion  by  calling 
the  bus  cycles  message  transier  steps. 

In  £,-1,  the  control  function  step,,  1  $  5  «  k/2.  specifies 
when  5:,  - 1  and  5;,  should  start  sending  their  messages.  Spe¬ 
cifically  let  step,  =  0  and  let  step,.  1  <  .r  k/2.  be  the  time 
interval  in  number  of  petit  cycles  between  the  initiations  of 
steps  1  and  s.  Then,  messages  from  step  s  and  step  s  -  1 
will  not  collide  if 

3  , ,  ,  A 

step,  =  step,  i+m--2-^  l<s«- 

4  « 
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FIG.  6.  ( a )  Interleaved  pipelining  and  ( b )  overlapped  pipelining  of  messages  in  the  4-cube.  ( l  is  measured  in  petit  cycles.) 


The  send  and  ^ait  functions  defined  in  the  previous  subsec¬ 
tion  are  still  applicable  here,  but  they  are  now  defined  with 
respect  to  the  time  determined  by  step,,  the  beginning  of 
transfer  step  J,  rather  than  the  beginning  of  each  bus  cycle 
5.  Figure  fib  shows  the  result  of  overlapped  pipelining  of  the 
message  pipelines  in  Fig.  fia.  Note  that  in  interleaved  pipe¬ 
lining  there  was  also  some  overlapping  between  the  two  mes¬ 
sage  pipelines  generated  in  two  consecutive  bus  cycles,  as 
can  be  seen  from  the  messape  pipeline  at  time  t  =  Ifi  in  Fig. 
fia.  But,  as  has  been  mentioned  previously,  interleaved  pipe¬ 
lining  does  not  fully  utilize  the  pipelined  bus. 

These  control  functions  step,  send,  and  watt  together  result 
in  a  minimized  total  communication  time.  To  show  this  we 
first  note  that  since  the  cardinality  of  .V/i,,  1  ^  /i  $  /c.  is  n/ 
2.  the  total  number  of  messages  is  kn/2.  Thus,  if  we  assume 
that  the  message  distribution  over  processors  is  perfectly  even 
in  each  bus  cycle  i  every  processor  has  a  message  to  send  in 
each  bus  cycle),  then  the  time  needed  for  transferring  these 
messages  is  at  least  f kn/2n1  =  k/ 2  bus  cycles,  or  equivalently 
kn/2  petit  cycles.  In  our  case,  however,  such  an  assumption 
of  even  message  distribution  does  not  hold.  For  example, 
no  message  can  be  inserted  on  the  bus  processor  n  -  1  in 
the  first  bus  cycle,  as  can  be  seen  from  the  message  pipeline 
at  time  f  =  0  in  Fig.  fib.  Now  we  compute  the  total  time,  in 
number  of  petit  cycles,  using  the  control  functions  deter¬ 
mined  above.  It  can  be  shown  that 


Th<  time  due  to  send;  at  step  k/2  is  ^  =  n/4.  Finally  it 
takes  n  petit  cycles  for  the  bus  to  clear  out.  Therefore  the 
total  time  in  numbe;  of  petit  cycles  is 
/  k  5\  n  k 

\2  4/  4  2 


Finally,  we  note  that  interleaved  message  pipelining  may 
also  be  applied  to  binary  tree  routing  patterns.  From  our 
previous  discussion  we  know  that  the  parent-to-children 
message  routing  in  f,;  has  to  be  done  in  two  bus  cycles  and 
that  the  same  message  routing  task  can  be  performed  using 
a  single  bus  cycle  in  £,2.  Communication  efficiency  in  £12 
can  be  further  improved  by  using  interleaved  message  pipe¬ 
lining  because  during  parent-lo-children  message  transfer 
only  every  other  node  is  sending  a  message.  Thus  each  parent 
can  send  two  messages  to  each  child  in  one  bus  cycle. 

4.  TWO-DIMENSIONAL  ARRAY  PROCESSORS 
WITH  PIPELINED  BUSES 

Linear  optical  buses  have  the  disadvantage  that  message 
transfer  may  incur  0{  .V)  time  del^  in  an  /V-processor  sys¬ 
tem.  To  reduce  this  delay  to  0(Vn),  we  consider  two-di¬ 
mensional  APPBs.  In  a  two-dimensional  APPB,  each  node 
is  coupled  to  four  buses  as  shown  in  Fig.  7a,  where  the  two 
honzontal  buses  are  used  for  passing  messages  horizontally 
in  the  same  way  as  before,  and  the  two  vertical  buses  are 
used  for  passing  messages  vertically  in  a  similar  way.  For 
convenience  we  diagram  our  two-dimensional  APPB  as  in 
Fig.  7b.  Each  node  in  a  two-dimensional  APPB  of  size  .V 
=  m  y  n  will  be  given  two  identifications,  one  being  a  pair 
of  numbers  (.t,  y).  0  ^  .x  <  m,  0  «  y  <  n.  indicating  the 
row--column  position  of  the  node  in  the  two-dimensional 
APPB.  and  the  other  being  the  row-major  index,  i  -  xn  +  y. 
0  ^  /  <  ,V,  of  the  node.  Corresponding  to  the  bus  cycle 
defined  for  the  linear  case,  in  the  two-dimensional  APPB  we 
define  nr  and  mr  3s  3  nm  bus  cveie  and  a  column  bus  cycle. 
respectively,  where  r  is  a  petit  cycle  as  defined  previously. 
When  there  is  no  confusion,  e.g.,  while  talking  about  mes.sage 
transmissions  in  a  row.  we  simply  say  a  bus  cycle  instead  of 
a  row  bus  cycle 
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4, 1 .  Message  Routing  in  Two-Dimensional  AFPB 

A  unique  issue  that  anses  in  the  two-dimensional  APPB 
IS  the  relay  of  messages.  Specihcally,  suppose  a  message  is 
to  be  transferred  from  node  (  .v, ,  ,V  | )  to  node  (.xs,  vy),  with 
A,  ^  ,V;  and  y,  #  vy .  Then  the  message  may  first  be  sent  from 
(  x, .  y, )  to  (A] .  vy ),  which  is  the  node  at  the  intersection  of 
row  and  column  vy ,  in  the  first  bus  ( a  row  bus  cycle )  and 
then  from  {.x,,  vy)  to  (,X:,  vy)  in  the  second  bus  cycle  (a 
column  bus  .rycle).  That  is.  the  message  has  to  be  buffered 
at  node  (  x,.  vy)  at  the  end  of  the  first  bus  cycle  and  then 
relayed  to  its  destination  m  the  second  bus  cycle.  For  the 
purpose  of  relaying  the  message,  we  define  a  control  function 
relay  for  node  (  x, ,  vy )  as 

rc/uv  [(  Xi,  vy)l  =  vy  -  V  i, 

which  indicates  that  node  (  Xi .  vy )  will  read  a  message  from 
a  row  bus  at  time  1  vy  -  vy  I  ( relative  to  the  start  of  the  row 
bus  cycle )  and  then  wnte  that  message  on  the  proper  column 
bus  at  the  beginning  of  the  following  column  bus  cycle.  If 
re/fiy[(  .X| ,  vy )  )  =  0,  then  no  message  is  to  be  relayed  by  node 
(.X| ,  vy ).  Clearly,  in  the  worst  case  up  to  n  messages  have  to 
be  relayed  and.  therefore,  n  relay  buffers  are  needed  at  the 
relaying  node.  Now  we  are  ready  to  show  how  the  four  most 
commonly  used  message  routing  patterns  discussed  in  the 
previous  section  can  be  realized  in  the  two-dimensional 
APPB. 

One-lo-One  The  system  executes  a  S£.VD((.Xi ,  v, ),  (  xy, 
vy)]  instruction,  which  requires  that  node  (x,.  y, )  send  a 
message  to  node  (  x;.  vy).  We  have  rt>/av[(.V|.  vy)j  =  >y 
-  V  |  ( in  row  bus  cycle),  and  nu;/(i  x-.  vy )]  =  x^  -  x,  ( in  col¬ 
umn  bus  cycle).  This  communication  takes  two  bus  cycles. 

Broadcast  The  system  executes  a  BROADC.AST[{  x.  y)] 
instruction,  which  states  that  node  (  x,  v  )  broadcasts  the  same 
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message  to  all  other  nodes  (  x,.  y,).  In  a  row  bus  cycle,  (.v. 
y)  broadcasts  the  message  to  nodes  (,x,  y,).  y,  T  y.  Then  in 
the  following  column  bus  cycle  all  (.v.  y,).  including  '  .y  y ). 
broadcast  the  message  in  their  corresponding  colum.ns.  Thus 
rp/mK-x.  y^)|  =  y,  -  y.  and  vvai7[(.v,.  y^)]  =  .x,  -  x.  This 
communication  also  takes  two  bus  cycles. 

Semigroup  Communication  This  corresponds  to  the 
execution  of  SEMIGROl'P[(x.  y)|,  which  says  that  some 
global  information  is  to  be  computed  and  stored  at  node  (  x, 
y ).  This  task  can  be  accomplished  using  two  linear  semigroup 
operations,  one  in  rows  and  the  other  in  a  column.  That  is, 
first  we  view  each  row  as  a  linear  APPB  and  do  SEMT 
OROL'P(y)  in  all  rows.  Then  in  column  y.  we  perform 
SEMIGROL  Pix).  Thus  2  log(n)  bus  cycles  are  needed  for 
this  task. 

Permutations  Let  PERM[{x.  y)]  be  an  arbitrary  per¬ 
mutation.  To  avoid  using  n  relays  at  each  node,  we  can  use 
a  three-phase  routing  approach  [24.  32]  or  equivalently  a 
three-bus-cycle  approach  in  the  two-dimensional  APPB  In 
this  approach  the  first  bus  cycle  is  a  “preprocessing"  step 
which  distributes  messages  in  each  row  such  that  the  messages 
going  to  the  same  row  will  occupy  different  columns.  Then 
the  second  and  third  bus  cycles  will  route  the  messages  to 
their  destination  row  and  destination  node,  respectively.  We 
note  that  for  arbitrary  permutations  this  approach  implies 
the  use  of  a  centralized  controller  which  would  compute  the 
message  destinations  for  the  preprocessing  step.  This  cal¬ 
culation  requires  the  construction  of  a  bipartite  graph  and 
Its  partitioning  into  complete  matchings,  which  would  dom¬ 
inate  the  time  complexity  for  the  total  task  of  computing 
and  implementing  an  arbitrary  permutation.  In  applications 
where  a  permutation  can  be  precomputed,  this  time  cost  can 
be  amortized  over  many  subsequent  applications  of  the  per¬ 
mutation. 

4.2.  Embedding  Binary  Trees  in  Two-Dimensional  APPB 

.As  mentioned  previously,  arbi’rary  message  routmg  and 
permutations  in  two-dimensional  APPB  may  require  n  re¬ 
laying  buffers  in  each  node  in  the  worst  case.  In  this  subsec¬ 
tion  we  present  an  embedding  for  a  binary  tree  network  in 
which  only  one  relay  buffer  is  needed  to  route  messages.  .An 
embedding  oi  an  L-level  complete  binary  tree  into  a  two- 
dimensional  .^PPB  with  n  =  2‘  columns  may  be  obtained 

by  ( I )  mapping  levels  0 . k  I  of  the  tree  to  row  0  of 

the  two-dimensional  ,-\PPB  and  ( ii )  mapping  level  /,  k  ^  i 

<  !..  of  the  tree  to  the  2'  ‘  rows.  2  ‘2'  ‘  +  1 . 2' 

-  I.  of  the  APPB  such  that  the  two  children  of  the  same 
parent  are  mapped  mto  two  adjacent  rows  in  the  same  col¬ 
umn  as  the  parent.  .Specifically  we  ^"fine  our  embedding  of 
a  binary  tree  network  into  the  two-dimensional  APPB  by  a 
mapping  Fit)  =  (£,(;),  £.(;)),  which  maps  each  node  ; .  i 
5=  1  <  2',  in  the  tree  to  a  node  )/■',(/).  Fji))  m  the  two- 
dimensKvnai  APPB  Let  /  be  a  nixie  at  level  /.  0  ft-  I  <  k .  m 
the  binary  tree  The  mapping  is  defined  by 
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0,  1  ^  j  <  2*, 

2^'*  +  /  mod  2^'\  2*  <  /■  <  2^, 


and 


Fy(l) 


i.  1  s=  i  <  2'‘, 

I  mod  2 '  I 


2*  «  i  <  2^. 


Let  v)],  where  {.v,  y)  =  F{i).  be  the  wait  func¬ 

tions  for  a  parent  node  i  to  receive  a  message  from  its  left 
and  right  child  for  5  =  0  and  1 ,  respectively.  For  the  case  1 
<  /  <  2*  '.  the  results  for  the  linear  APPB  directly  give 
>')]  =  -(  V  +  5).  For  the  case  2*  ^  i  <  2^“',  let 
/  be  at  level  1.  k  ^  I  <  L  -  1,  and  i  =  p2''*  q.  Then 

Fy(i)  =  2'“‘  -t-  i  mod  =  2''^  +  q\ 

FA2i  +  5)  =  2'"'-*  +  (2/  -E  5)mod  2'*'-^ 


As  an  example  the  embedding  for  the  4-level  binary  tree 
in  Fig.  8a  is  shown  in  Fig.  8b.  Let  us  call  this  embedding 
£,3-  £,3  has  the  following  properties;  (i)  Parent  nodes  i.  1 
^  1  <  2*"' ,  and  their  children  are  in  row  0;  ( ii )  parent  nodes 
/,  2*’'  ^  (  <  2*.  which  are  in  row  0,  have  their  children  in 
row  1;  and  (iii)  parent  nodes  i,  2*  <  i  <  2^"'.  and  their 
children  are  in  the  same  column.  Properties  (i)  and  (ii)  are 
obvious.  Here  we  prove  only  (lii).  Since  in  the  binary  tree 
each  parent  node  t  has  two  children  2/  +  5, 5  =  0,  1 ,  to  prove 
(iii)  we  need  only  show  that  FAi)  =  £,(2/  +  5)  for  2*  <  / 
<  2'-'' .  For  that,  let  i  be  a  parent  node  at  level  /,  where  k 
^  /  <  L  -  1  and  i  =  p2'  +  q  for  some  integers  p  and  q  such 
that  0  <  q  <2‘ .  Then 


=  +  (p2'^'-^  -t-  2q  -h  5)mod  2'"'“* 

=  2'"''^  +  2i?  +  6;  vvui£.s[(.v,  >■)]  =  FAi)  -  £<(2/  +  5) 

=  (2''‘  +  <7)  -  (2'*'-‘  +  2q-hS)  =  ~(x  +  S). 

waitpAj)  can  be  obtained  by  recalling  that  wait^AI) 
=  -watiyA>)>  where  /  is  the  parent  of j. 

For  the  case  where  2*"'  ^  t  <  2*,  a  wail  and  a  relay 
function  are  needed.  Let  rc/a>c.a[(0,  y)],  0  ^  y  <  2*,  be  the 
relay  function  of  node  (0.  y)  for  relaying  the  message  from 
a  child  node  (again,  5  =  0  for  the  left  child  and  6  =  I  for  the 
right  child)  to  its  parent.  Then  we  can  show  that 


£^,(2/  -t-  5)  = 


( 2{p2‘  +  q)  +  5)mod  2'*' 


(p2'*'  +  2q  +  5)mod  2'"' 


Iq^h 

q 

1 

2l-k 

£>(/). 


It  is  now  clear  that  the  relay  function  is  not  needed  for 
message  transfer  between  parent  nodes  t  and  their  children 
if  1  /  <  2  * ' '  or  2  <  i  <  2  ^  ‘ .  However,  such  a  relay  is 

needed  if  2*^'  «  i  <  2*.  The  wait  and  relay  functions  for 
£,3  are  obtained  in  the  following. 


FIG.  8.  la)  A  4-level  binary  tree 
dimensional  APPB 
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I  hi  It.s  embedding,  in  the  iwo- 


re/avc,6[(0.  y)]  =  - 1,  0  y  <  2*, 
wa«<,,5{(0.  >’))  =  2*  -  y  -  5.  2*""'  <  y  <  2*. 

Note  that  each  node  (0,  y)  needs  to  relay  only  one  child-to- 
parent  message  with  the  message  from  left  ( right )  child  being 
relayed  by  (0,  y)  with  y  even  (odd),  and  that  even  though 
node  0  is  not  a  node  in  the  tree,  it  helps  relay  messages.  Also 
note  that  relay^j,  is  applicable  to  column  bus  cycles,  '-low  let 
re/avp.4((0.  y)].  y  even  (odd),  be  the  relay  function  for  node 
(0,  y)  to  relay  the  message  from  a  parent  (0.  F)  to  its  left 
( right)  child  for  6  =  0  ( 1 ),  Then  rela\\,i  is  easily  obtained 
from  re/aypj[(0.  y)]  =  -H'ai7pj((0.  ^’)1'  And  wait  pi  is  de¬ 
termined  as  in  the  linear  case. 

4.3.  Network  Embeddings  Requiring  So  Relays 

Embedding  £,3  still  requires  one  message  relay  for  com¬ 
munication  between  two  neighboring  nodes  in  binary  trees. 
To  further  improve  the  communication  efficiency,  in  this 
subsection  we  show  how  to  obtain  embeddings  of  binary 
trees  as  well  as  hypercubes  such  that  no  such  message  relay 
is  needed.  Two  approaches  may  be  used  to  eliminate  message 
relays  by  intermediate  nodes:  a  hardware  approach  and  a 
“software"  approach.  In  the  hardware  approach,  optical 
switches  are  used  at  the  intersections  of  row  and  column 
buses  to  switch  an  optical  signal,  say.  from  a  row  bus  to  a 
column  bus.  without  requinng  relay  by  an  intermediate  pro¬ 
cessor  [15],  In  this  paper  we  consider  the  “software"  aph 


278 


GUO  ET  AL. 


proach,  which  relies  on  designing  embeddings  such  that  all 
neighboring  processors  in  a  network  are  mapped  into  the 
same  row  or  column  in  the  two-dimensional  APPB.  Thus, 
no  message  relay  is  needed  and  no  relay  function  is  required. 
This  improves  the  communication  efficiency  significantly. 
However,  it  has  the  disadvantage  that  nodes  in  the  APPB 
may  not  be  fully  utilized. 

A  basic  measure  that  is  usually  used  to  evaluate  the  quality 
of  an  embedding  of  a  source  graph  G,  =  { 1', ,  U, }  with  a  set 
of  nodes  f'l  and  a  set  of  edges  L\  into  a  mesh  architecture 
with  a  set  of  nodes  1':  is  the  expansion  cost,  which  is  defined 
as  the  ratio  of  the  number  of  nodes  in  the  target  mesh  to  the 
number  of  nodes  in  the  embedded  graph.  Another  measure 
useful  for  such  evaluation  is  the  dilation  cost.  Specifically, 
the  dilation  of  an  edge  u  E.  L\,  which  is  mapped  to  a  path 
Q  in  the  target  mesh,  is  \Q\  -  1,  where  1 is  the  number 
of  nodes  on  Q  However,  the  mesh  model  corresponding  to 
that  of  APPBs  is  different  from  those  studied  previously  [1, 
13,  34]  because  the  efficiency  of  the  communication  between 
any  two  nodes  in  the  same  row  or  column  in  an  APPB  does 
not  depend  on  the  distance  between  these  two  nodes.  There¬ 
fore  the  criterion  that  is  to  be  satisfied  by  an  embedding  is 
different  from  previously  studied  criteria.  Specifically,  it  is 
desirable  to  obtain  an  embedding  in  which  any  two  neigh¬ 
boring  nodes  in  the  source  graph  are  mapped  into  either  the 
same  row  or  the  same  column  in  the  twcv-dimensional  APPB, 
thus  allowing  them  to  communicate  with  each  other  using 
a  single  bus  cycle.  An  embedding  which  satisfies  this  require¬ 
ment  will  be  said  to  satisfy  the  alignment  condition.  Note 
that  £,3  obtained  in  the  previous  subsection  does  not  satisfy 
the  alignment  condition  and  thus  requires  message  relays. 
That  embedding,  however,  does  have  an  optimal  expansion 
cost  of  2^/(2^  -  1 ).  In  contrast,  the  binary  tree  embedding 
presented  in  the  following  satisfies  the  alignment  condition, 
but  its  expansion  cost  is  not  optimal.  This  demonstrates  a 
trade-off  between  the  expansion  cost  and  the  dilation  cost 
for  network  embeddings  in  the  two-dimensional  APPB. 

Consider  Fig.  9a  and  assume  that  we  already  have  an 
embedding  of  an  5-level  binary  tree  with  .V,  =  2’  -  1  nodes 
into  a  two-dimensional  APPB  of  size  u,  X  i,.  The  embedding 
is  assumed  to  satisfy  the  alignment  condition.  That  is.  all 
the  neighboring  nodes  in  the  5-level  tree  are  mapped  into 
the  same  row  or  column  in  the  two-dimensional  APPB.  Us¬ 
ing  this  level  s  embedding  { starting  level )  as  building  blocks, 
the  embedding  for  an  ( 5  +  2  )-level  tree  is  obtained  as  shown 
in  Fig.  9b.  Clearly  in  this  embedding  the  neighboring  nodes 
are  again  on  the  same  row  or  column.  A  still  larger  tree  is 
obtained  by  repeating  this  modular  building  procedure  until 
the  desired  size  is  achieved.  Let  us  call  this  embedding  £,4. 
Assuming  that  in  £,4  the  embedding  of  an  L-level  tree.  L 

=  5  +  2(t.  <T  =  0,  1 . occupies  an  area,  in  number  of 

nodes,  equal  to  .1;.  in  the  two-dimensional  APPB.  we  may 
inductively  prove  that 


a. 


1 


b. 


(a) 


FIG.  9.  Modular  embedding  of  binary  trees.  E.i.  in  the  two-dimensionai 
APPB.  (a)  A  building  block  in  which  an  r-level  binary  tree  is  embedded, 
(b)  Embedding  of  an  ( j  +  2  )-level  binary  tree. 


With  this  result,  the  expansion  cost  for  the  embedding  of  an 
£-level  tree  is 


+  (I  - 
2^  -  I 


2^  ’[,4,  -I-  (I  - 

+  I)  -  I 


It  can  be  checked  that  C/.  is  monotonically  increasing  with 
L.  However,  for  large  L.  the  value  of  Ci_  asymptotically 
equals 


r  = 

Lz.max  ^  j  ' 

Note  that,  if  I  and  .-I,  a^.  the  value  of  Cl  simplifies 

to  Cl  ~  Tj/Nj  =  C,  >  I.  That  is.  the  expansion  cost  for  the 
entire  embedding  is  determined  by  the  expansion  cost  of  the 
building  block.  Thus  low  expansion  costs  may  be  obtained 
if  the  starting  building  block  satisfies  I.  .4j  £,  and 

C;  -*■  I .  Some  examples  of  building  blocks  are  shown  in  Fig. 
lO  with  their  corresponding  expansion  cost  C^.  .  Note  that 

in  this  modular  embedding  scheme,  as  the  embedding  goes 
one  level  higher,  the  number  of  levels  of  the  tree  increases 
by  2.  Thus  if  5  is  even  ( odd )  then  L  is  even  ( odd ) .  Therefore 
according  to  whether  the  desired  level  L  of  the  tree  is  even 
or  odd,  the  starting  level  5  must  be  chosen  properly, 

To  determine  the  control  functions  for  £,4.  let  r,  be  the 

root  at  an  embedding  level  /.  /  =  5  +  2,  5  +  4 . L.  and 

(.X;.  >■/)  be  the  coordinate,  .e.,  the  row-column  position,  of 
r,  in  the  two-dimensional  APPB.  Then  from  Fig.  9b.  the 
coordinate  of  r,  is 


(-V,.  Vi)  =  (a,  -  I  ). 

where  a,  and  h/  can  be  found  to  be  equal  to  2''  "  M  u.  +  I ) 
-  I  and  2''  respectively.  Thus. 


A,  =  2'  '(.4,  +  (  I  -  2  " 


(,v,,  V,)  =  (2''  '  ''-'’(u.  +  I  )  -  !.  2"  '  '  I  ). 
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FIG.  10.  Example  building  blocks  for  the  modular  embedding  of  binary- 
trees.  £,«,  and  their  corresponding  expansion  costs  C^ (a>  1.5.  (b)  1.5, 
(c)  IT5.  (d)  1.31.  (e)  I  22.  (f)  I  17 


Now  the  control  functions  can  be  determined  as  follows. 
First,  within  the  building  block  determine  the  wa/t  functions 
according  to  the  specific  building  block  in  use.  Let  (.x,,  v,) 
be  the  coordinate  of  r,.  the  root  in  the  building  block.  We 
then  need  only  determine  the  vitiu  functions  for  the  new 
nodes  which  appear  as  we  go  to  a  higher-level  embedding. 
For  example,  in  Fig.  9b  when  we  go  from  level  5  to  s 
+  2,  the  new  nodes  are  By  letting 

wa/f,.  „,(/■/)  be  the  na//  function  for  node  r/  to  receive  a  mes¬ 
sage  from  child  node  we  have 

=  .V/  - 

waitc.t-,(b)  =  -(,V/  -  +  1 ), 

naitc.„.Ju,}  =  waiter, .AVi)  =  -  .v,-;). 

where  the  coordinate  (.V/,  >7)  is  as  determined  previously. 
These  are  the  wait  functions  for  the  new  parents  to  receive 
messages  from  their  children.  The  wait  functions  for  the 
children  to  receive  messages  from  these  new  parents  are  ob¬ 
tained  by  recalling  that  waitp  =  -wait^. 

Next  we  show  that  the  binary  hypercube  of  2  nodes  can 
also  be  embedded  in  a  two-dimensional  APPB  of  size  2* 
X  2*  such  that  the  alignment  condition  is  satisfied.  As  in  the 
case  of  binary  trees,  the  embedding  is  again  modular  with 
the  basic  module  being  the  binary  2-cube  shown  in  Fig.  1  la. 
A  3-cube  embedding  is  obtained  by  putting  together  two 
such  2-cubes  side-by-side  as  shown  in  Fig.  1 1  b.  and  a  4-cube 
embedding  is  obtained  by  putting  together  two  3-cubes  one 
aoove  the  other  as  shown  in  Fig.  1  Ic  and  so  on.  Note  that 
the  nodes  in  Fig  i  Ic  correspond  to  the  cube  nodes  of  Fig. 
5a.  In  this  way  the  embedding,  denoted  £^2.  of  the  binary 
hypercube  of  the  desired  size  is  obtained  modularly. 

It  is  observed  that  in  embedding  each  row  and  column 
IS  Itself  a  binary  /I: -cube.  For  example,  if  we  take  the  column 
number  y  as  the  node  id  for  the  nodes  m  any  row  v.  then 
row  V  IS  a  binary  i4.-cube  consisti.'-.q  of  nodes  r.  0  '  i  <  2‘ 


Let  us  call  each  row  or  column  a  subcube.  Then  we  have 
2*“  such  subcubes.  For  each  subcube,  if  we  use  the  column 
id  >'  (or  the  row  id  v)  to  identify  its  nodes,  all  the  control 
functions  step.  send,  and  wait  are  exactly  the  same  as  those 
derived  for  in  the  linear  APPB.  Thus  the  total  commu¬ 
nication  time  for  emulating  the  hypercube  can  be  minimized 
through  overlapped  pipelining  as  presented  in  the  previous 
section.  It  can  be  seen  that  all  the  neighboring  nodes  in  the 
hypercube  are  mapped  to  either  the  same  row  or  the  same 
column  in  .he  two-dimensional  APPB.  Therefore  £.2  satisfies 
the  alignment  condition  and  thus  requires  no  message  relay 
for  communications  between  neighboring  nodes  in  the  hy¬ 
percube.  Finally,  since  the  number  of  nodes  used  in  the  two- 
dimensional  APPB  is  equal  to  that  of  the  hypercube,  we 
achieve  a  minimal  expansion  cost  of  unity. 

5.  bandwidth  analysis 

In  this  section,  we  evaluate  the  merit  of  the  pipelined 
communication  structure  by  comparing  it  with  linear  arrays 
which  utilize  nearest-neighbor  and  exclusive  access  bus  in¬ 
terconnections.  We  evaluate  the  different  models  irrespective 
of  the  technology  used  to  implement  them.  In  other  words, 
we  assume  that  the  transmission  rate  and  the  propagation 
delay  are  the  same  for  both  optical  and  electronic  commu¬ 
nication  links. 

Consider  the  linear  array  of  n  processors  with  nearest- 
neighbor  connections  as  shown  in  Fig.  lb  and  assume  that 
the  physical  separation  between  each  pair  of  neighboring 
processors  is  D.  Such  an  array  may  emulate  one  cycle  of  a 
pipelined  bus  in  a  time  n(  fp  +  To),  where  To  is  the  prop¬ 
agation  time  required  for  a  signal  to  travel  the  distance  D 
and  Tp  is  the  time  required  to  process  a  message  at  the  sending 
and  the  receiving  ends  of  a  communication  link.  Tp  includes 
synchronization,  message  generation,  buffering,  and  routing. 
We  note  that  for  the  cases  of  interleaved  and  overlapped 
pipelining  discussed  in  Section  3.3,  at  most  two  messages 
might  be  processed  in  this  time.  The  bandwidth  of  the  near¬ 
est-neighbor  connected  array,  £3.  defined  as  the  maximum 
number  of  messages  that  may  be  transmitted  per  second,  is 
thus  given  by 

B  =  "  -  '  ‘ 

n{  Tp  +  Tp)  Tp  p  +  \ 

where  p  =  Tpl  T--,. 
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For  the  pipelined  linear  APPB,  the  optical  distance,  Dq. 
between  two  consecutive  processors  should  be  larger  than 
the  message  length  (see  Eq.  ( 1 )  in  Section  2).  In  other 
words,  if  D  >  bwcg,  we  set  Do  =  D  \  otherwise  Do  should  be 
made  equal  to  bWg  ( for  example,  by  coiling  an  optical  fiber) 
so  that  each  processor  can  inject  a  message  into  the  bus 
without  collision.  Thus,  the  signal  propagation  time, 
between  two  consecutive  processors  is  max  {  aTo}. 

where  a  =  (bwc^f)/ D.  The  pipelined  bus  cycle  time  is  then 
Tp  +  nTomax  { 1,  a}  -  Given  that  n  messages  may  be  trans¬ 
mitted  during  a  pipelined  bus  cycle,  the  bandwidth  of  the 
pipelined  bus  is 


Tp  +  n/omax  ,  1,  a | 

and  thus. 

Bp  nip+l) 

Ba  p  +  n  max  (  1 ,  a  } 

In  Fig.  1 2a  a  parametric  plot  showing  the  relation  between 
Bp/ and  p  is  given  m  terms  of  n  for  a  «  I  and  «  >  I.  The 
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FIG  12  The  ratiG.  W,  oi  ihe  nandvMdth  n|  j  pipelined  bus  it>  ihai 
gT  a  linear  arra'.  'J'-iih  noaresi-neighNtr  connections  os  a  lunciion  ol  #>.  o. 
and'’  fa'  \  parametnc  curse  '  h  i  F  iir  a  hxed-si/e  svsiem  with  n  "  M 


curv'e  for  a  ^  1  corresponds  to  the  case  where  the  message 
length  is  less  than  or  equal  to  the  physical  separation  between 
processors,  while  the  curve  for  a  >  1  reflects  the  case  where 
message  length  is  longer  than  the  physical  separation  between 
processors,  and  thus  the  optical  path  has  been  extended  to 
accommodate  the  entire  message.  By  taking  the  limit  of  Eq. 
( 3 )  as  p  -*■  X ,  it  is  clear  that,  for  fixed  a  and  large  p,  the 
ratio  Bp/ B^  approaches  n.  Also,  when  p  =  1  and  a  <  1,  we 
obtain  Bp/ B^  =  2  In  Fig.  12b  we  plot  Bp/ B^  versus  p  for  a 
fixed-size  array  with  n  =  64  and  for  several  values  of  a.  These 
plots  show  that  the  pipelined  bus  is  more  effective  for  larger 
values  of  p  and  smaller  values  of  a. 

For  multiprocessor  interconnections,  D  is  determined  by 
placement  and  routing  within  VLSI  chips,  by  PC  board  con¬ 
nections,  or  by  back-plane  interconnections.  In  all  cases,  D. 
and  therefore  To.  is  relatively  small.  Given  that  Tp  is.  at 
least,  on  the  order  of  microseconds,  the  ratio,  p,  of  processing 
to  communication  times  should  be  much  larger  than  1  (on 
the  order  of  10-1000).  Also,  with  current  technology  it  is 
reasonable  to  assume  that  a  is  relatively  small  ( between  1 
and  10).  For  example,  for  board-to-board  communications 
(D  ^  10  cm),  it  is  possible  to  drive  an  optical  communi¬ 
cation  line  at  the  speed  of  10  GHz.  Assuming  that  the  speed 
of  light  in  optical  fibers  is  =  2  X  10*  m/s.  and  that  each 
message  contains  6=16  bits,  we  obtain  a  =  3.  The  same 
value  of  a  is  obtained  if  optical  communications  are  imple¬ 
mented  on  GaAs  wafers  at  100  GHz  and  a  physical  processor 
separation  of  1  cm.  Note  that  the  value  of  a  may  be  reduced 
if  parallel  buses  are  used  to  reduce  b . 

Next  we  compare  the  bandwidth  of  a  pipelined  bus  with 
that  of  an  exclusive  access  bus.  Given  that  the  bandwidth  of 
an  exclusive  access  bus  is  =  1  /  (  Tp  +  nFo),  we  have 

Es  -  nip  +  n) 

Bf  p  +  n  max  i  1  ■  a  | 

This  shows  that  as  a  approaches  1,  the  pipelined  bus  can 
accommodate  n  messages  in  the  same  cycle  time  as  the  ex¬ 
clusive  access  bus.  For  larger  o.  the  pipelined  bus  cycle  will 
be  stretched  to  accommodate  the  length  of  the  messages,  and 
thus,  the  performance  gain  due  to  pipelining  will  be  less 
than  n. 

The  above  analysis  is  independent  of  the  media  used  for 
communication.  If  optical  pipelined  buses  are  to  be  com¬ 
pared  with  electronic  buses,  then  the  physical  constraints  on 
the  electronic  propagation  speed  should  be  taken  into  ac¬ 
count.  Speciftv.allv,  the  effect  of  capacitive  loading  and  mu¬ 
tual  inductance  on  the  signal  propagation  speed  ( the  trans¬ 
mission  line  effect)  should  be  considered  Thus,  message 
pipelining  using  electro-optical  technology  offers  a  potential 
for  substantiallv  enhancing  bandwidth  utilization.  Funher. 
pipelining  techniques  will  be  of  increasing  effectiveness  be¬ 
cause  this  technology  offers  the  capability  of  generating  very 
short  pulses  (12.  33].  thus  reducing  u  and  decreasing  a. 


OPTICALLY  INTERCONNECTED  ARRA'i  S 


281 


6.  CONCLUDING  REMARKS 

We  have  presented  efficient  communication  architectures 
which  exploit  the  optical  signal’s  properties  of  unidirectional 
propagation  and  predictable  path  delays  in  order  to  pipeline 
messages  on  optical  buses.  As  shown  in  Section  5,  the  pipe¬ 
lined  model  has  its  merits  irrespective  of  the  technology  in 
which  it  is  implemented.  Although  the  presentation  in  this 
paper  is  based  on  an  optical  model  in  which  delays  inherent 
in  optical  fibers  serve  as  slots  for  space  multiplexing,  it  is 
possible  to  use  shift  registers  as  buffer  memories  for  these 
slots  [36].  Thus  pipelined  buses  may  be  implemented  in 
either  optics  or  electronics.  However,  for  the  electronic  im¬ 
plementation.  the  signal  propagation  delay.  To,  will  depend 
on  the  speed  of  the  shift  registers,  resulting  in  a  relatively 
small  value  for  the  ratio  of  processing  to  communication 
times,  p. 

We  proposed  efficient  approaches  to  fundamental  message 
routings  including  one-to-one,  broadcast,  semigroup  com¬ 
munications.  and  permutations  for  the  APPB  architectures. 
Such  efficient  accomplishment  of  these  commonly  used 
message  routing  patterns  can  significantly  improve  the  effi¬ 
ciency  of  many  parallel  algorithms.  We  presented  here  effi- 
oent  embeddings  of  the  binary  trees  and  hypercube  networks. 
Embeddings  for  other  well-known  interconnection  networks, 
including  pyramids,  shuffle-exchange  networks.  X-binary- 
trees  [9],  and  X-quad-trees,  have  also  been  obtained  [14, 
16).  Such  efficient  embeddings  of  these  well-known  com¬ 
munication  structures  allow  all  algorithms  designed  for  these 
structures  to  be  efficiently  executed  on  the  APPB  architec¬ 
tures.  They  also  allow  an  .APPB  to  be  logically  reconfigured 
as  an  architecture  which  is  more  suitable  for  a  given  com¬ 
putation  task. 

We  have  not  considered  in  this  paper  several  issues  that 
are  relevant  to  the  implementation  of  the  proposed  archi¬ 
tectures.  Such  issues  include  the  synchronization  of  the  pro¬ 
cessors  to  the  accuracy  implied  by  the  speed  of  optics,  tem¬ 
poral  pulse  positioning,  optical  fanout,  and  the  distribution 
of  optical  power  in  a  way  that  allows  the  detector  at  each 
processor  to  detect  the  optical  signals  corref'tly.  These  issues 
must  be  addressed  with  regard  to  the  reliability,  scale,  and 
device  technology  which  is  appropnate  for  computing  ap¬ 
plications.  Some  of  these  issues  have  been  presented  in  [  7, 
25.  31]. 

In  our  experimental  work  (6,  8.  21]  we  are  investigating 
the  practical  limits  to  these  technological  concerns.  We  have 
shown  that  three  factors,  threshold  power  margin,  synchro¬ 
nization  error,  and  coupling  ratio,  determine  the  system  scale. 
On  the  basis  of  curreni  and  near-term  technology,  our  ex- 
penments  show  that  synchronization  error  does  not  ccn- 
tnbute  significantly  to  the  bounds  of  system  size.  Rather, 
power  distribution  effects  dominate.  Preliminary  investiga¬ 
tions  show  that  by  using  off-the-shelf  optical  components  we 
can  currently  build  linear  buses  operating  at  3(K)  MHz  and 
containing  about  100  processors.  Using  more  sophisticated 


electro-optics  (gallium  arsenide,  custom  couplers,  and  dual 
level  bus  structures)  we  believe  that  10-GHz  buses  of  over 
400  processors  are  feasible.  Further,  we  believe  that  near- 
term  technologies  such  as  fiber  amplifiers  as  well  as  alternate 
bus  structures  will  alleviate  the  power  distribution  problem. 
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Abstract 

An  optical  communicaticn  structure  for  multipro¬ 
cessor  arrays  that  explohs  the  high  communication 
bandwidth  of  optical  waveguides  is  proposed.  The  struc¬ 
ture  takes  advantage  of  two  properties  of  optical  signal 
transmissions  on  waveguides.  Namely,  unidirectional 
propagation  and  predictable  propagation  delays  per  unit 
length.  Two  novel  time-division  multiplexing  approaches 
are  proposed  for  non  SIMD  environments  to  obtain  a 
communication  bandwidth  comparable  to  that  of  mes¬ 
sage  pipelining  in  SIMD  environment.^.  Analysis  and 
simulation  residis  are  given  to  evaluate  the  communica¬ 
tion  efi'fc  live  ness  of  the  system.  A  clock  distribution 
method  is  also  proposed  to  address  potential  synchroni¬ 
zation  problems.  Finally,  feasibility  issues  with  current 
and  future  technologies  are  discussed. 

1.  Intruduction  and  background 

Optical  interconnections  for  communication  net¬ 
works  and  multiprocessor  systems  including  both  frec- 
spacc  and  guided  wave  [7,8,9,21,221  approaches  have 
been  studied  extensively  in  the  literatures.  In  this  paper, 
we  propose  a  waveguide  interconnection  system  with 
time-division  communications. 

Time-divisinn  communications  are  especially  use¬ 
ful  in  opucal  interconnected  systems  where  high  com- 
municauon  bandwidth  can  be  exploited.  Optical  pulse 
transmissions  on  a  waveguide  have  two  distinct  proper¬ 
ties  trom  electronic  signal  transmissions,  namely  uni- 
di:  ..ctional  propjcauon  and  predictable  propagation 
delays  per  unit  length.  In  a  multiprocessor  system  con¬ 
nected  with  an  optical  waveguide  (or  bus,),  relationships 
iw  ccn  the  spatial  and  temporal  positions  of  transmuted 
pulses  can  be  established.  For  example,  if  two  proces¬ 
sors  transmit  a  pulse  on  the  waveguide  at  the  same  time, 
(be  difference  between  arrival  umes  of  these  two  pulses 
.1  any  chccKpoir ;  dow  nstream,  is  equal  to  the  propaga¬ 
tion  delays  between  the  two  procesvors.  In  otlier  wo;d>. 

spatial  separauon  of  the  two  processors  determines 
uio  temporal  separation  between  the  pulses  thev 


transmit.  By  ananging  spatial  separations  and  control¬ 
ling  transmission  (or  receiving)  times  of  prexessors, 
time -division  multiplexings  (or  demultiplexings)  can  be 
done  without  using  multiplexers  (or  demultiplexers). 

Several  time-division  switching  approaches  can  be 
applied  in  a  multiprocessor  system  connected  by  optical 
buses.  In  tfie  first  approach,  each  processor  is  assigned  a 
fixed  time  slot  and  transmits  or  receives  a  message  dur¬ 
ing  that  particular  time  slot.  A  sequence  of  time  slots 
formed  on  the  transmitting  segment  of  a  bus  is  rear¬ 
ranged  via  a  time-slot  interchangcr  (21,24)  and  then  for¬ 
warded  to  the  receiving  segment.  Each  time  slot  of  the 
output  sequence  contains  a  message  destined  to  the  pro¬ 
cessor  corresponding  to  that  slot.  In  the  second 
approach,  each  processor  is  assigned  a  fixed  transmuting 
time  slot.  A  sequence  of  time  slots  formed  on  the 
transmitting  segment  is  directly  forwarded  to  the  receiv¬ 
ing  segment  without  interchanging  time  slots.  Instead  of 
assigning  a  fixed  receiving  time  slot  to  each  processor,  a 
S/MD  envu-onment  is  assumed  where  each  prixcssor 
knows  which  processor  is  sending  a  message  to  it  and 
knows  the  time  slot  that  contains  the  message.  Since 
there  is  a  one-to-one  mapping  between  a  source  proces¬ 
sor  and  a  ume  slot,  we  call  this  approach  time -division 
source-oriented  multiplexing  (or  TDSMi.  It  has  ais  i 
been  referred  to  as  bus  pipelining  in  (7,  16). 

TDSM  may  also  be  applied  in  a  non  .SV.MD 
environment,  where  source  processors  are  not  known  to 
the  dcsunaiion  processors.  In  this  case  each  messaee 
should  contain  address  information  so  tfiat  each  proces¬ 
sor  will  he  able  to  receive  messages  upon  address  deeixl- 
mgs  Another  approach,  which  is  also  applicable  m  a 
non  S/.MD  environment,  assigns  a  fixed  rcc  euine  time 
slot  to  each  processor.  Each  message  is  transmitied  dur¬ 
ing  the  receiving  ume  slot  assigned  to  iis  desun.mon 
processor  .Since  there  is  a  one  lo-one  mapping  between 
a  dcsunation  prixessor  and  a  time  slot,  we  call  this 
approach  um  .1  '.  oi.m  ii.  siinc.ion-orienied  muliiplexing 
•  or  IDDM)  Since  each  desuruiion  only  has  one  dedi 
cated  receiving  ume  slot,  contentions  can  occur  il 
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severaJ  sources  want  to  send  messages  to  the  same  desti¬ 
nation.  One  way  to  ensure  exclusive  access  of  a  ume- 
slot  IS  to  use  a  reservation  scheme. 

In  this  paper,  we  will  discus.>  TDSM  and  TDDM 
ap'^roaches  and  apply  the  combinauon  of  these  two  in 
our  system  design.  We  use  coincident  pulse  techniques 
13, 13.2Ui  to  encode  address  information  that  is  con¬ 
tained  in  messages.  In  order  to  explain  coincident  pul^e 
addres'ing,  we  consider  an  optical  bus  connected  linear 
array  ol  ,V  processors,  as  show  n  in  Figure  I . 


Figure  1.  A  linear  optical  anay 

Each  processor  transmits  on  the  upper  half  seg¬ 
ment  of  a  bus  and  receives  Irom  the  lovser  half  segment. 
The  optical  bus  consists  of  three  waveguides,  one  for 
carrying  messages  (the  message  waveguide)  and  two  for 
carry  ing  address  information  (  the  reference  wave.Qutdc 
and  the  seiea  waves^uide).  Messages  are  organized  as 
vidce  f^ane.'i.  which  have  a  certain  fixed  length  The 
propagation  delay  on  the  reference  waveguide  is  the 
same  as  that  on  the  message  waveguide  hut  not  the  same 
as  that  on  the  select  waveguide  A  fixed  amount  of  addi¬ 
tional  delay,  which  we  show  as  loops  in  Figure  2.  is 
inserted  onto  the  reference  waveguide  and  the  message 
waveguide. 
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Let  w  be  the  pulse  duration  in  seconds,  and  let  cj, 
be  the  velocity  of  light  in  the  waveguides.  Define  a  unit 
time  to  be  the  spatial  length  of  a  single  optical  pulse,  that 
IS  WXC4.  Starting  with  the  fact  that  all  three  waveguides 
have  equal  intnnsic  propagation  delays,  we  add  one  time 
unit  delay  between  any  two  proces.sors  on  the  receiving 
(lower  half)  segments  of  the  reference  waveguide  and 
the  message  waveguide  as  show  n  in  Figure  2  (a).  Since 
there  are  no  changes  on  the  transmitting  (upper  half) 
segments  of  any  waveguide  and  since  the  message 
waveguide  has  exactly  the  same  length  as  the  reference 
waveguide.  Figure  2(a)  shows  only  the  receiving  seg¬ 
ments  of  the  select  waveguide  and  the  reference 
waveguide. 

A  source  processor  sends  a  reference  pul.se  and  a 
select  pulse  at  appropriate  umes,  so  that  alter  these  two 
pulses  propagate  through  their  corresponding 
waveguides,  a  coincidence  of  the  two  occurs  at  the 
desired  destinauon.  The  source  processor  also  sends  a 
message  frame  which  propagates  synchronously  with  the 
reference  pulse.  Whenever  a  processor  detects  a  coin¬ 
cidence  of  a  reference  pulse  and  a  select  pulse,  it  reads 
the  message  frame.  More  specifically.  Let  i,cj  bo  the 
time  when  processor  1  transmits  its  reference  pulse  and 
/k/(j)  be  the  ume  when  it  transmits  a  select  pul.se.  These 
two  pulses  w  ill  coincide  at  processor  ;  if  and  only  if 

(st!  (j  )  =  (ref  -r  J  (1.1) 

where  ()<(,;  <  ,V .  This  means  that  for  a  given  refer¬ 
ence  pulse  transmitted  at  time  t,,/ ,  the  presence  01  a 
select  pulse  at  ume  ir,f  j  w  ill  address  procesMvr  y 
while  the  absence  of  a  select  pulse  at  that  ume  will  not. 
In  es.sence.  the  address  of  a  desunation  processor  is 
unary  encoded  by  the  source  processor  using  the  relative 
transmission  time  of  a  relerence  pulse  and  a  select  pulse. 

Call  the  durauon  of  each  pulse,  w  .  a  pulse  slot  .A 
sequence  of  pube  slots  on  the  select  waveguide,  ea^h 
with  cither  the  presence  or  the  ab'oncc  ol  c  selcM  puLe 
relaU'c  to  a  given  reference  puNe.  is  called  an  ad  irt  w 
frami  Figure  2  ihi  shows  a  siiapshoi  of  a  releicn^e 
pulse  and  an  ...1  lrc<s  frame  jii'-t  alter  thev  ■  ive  Ken 
transmitted  At  the  transmission  ume.  the  select  puNe  is 
/  units  K'hind  the  relerence  puKe  Since  the  relereruc 
pulse  will  K'  delayed  by  one  unit  each  ume  it  propagates 
through  a  prcKCssor  on  the  receiving  segment  ol  tiie 
reference  waveguide,  these  two  pulses  will  coincide  at 
processor  7 

With  the  aKive  unarv  addressing,  an  address 
frame  contain-  .N  pulse  sic;  .ind  is  essec.ti.diy  a  imi,' 
multiplexed  sequence  ol  puNc  slots,  cash  corrc'pori.i.rig 
to  a  destination  pt  .essor  .As  the  address  frame  p'o 
pagates  thro,jgh  tlic  receiving  segment  ol  tfie  bus. 
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(i'.muliiplcxings  of  each  pulse  slot  is  performed  with 
respect  to  a  reference  pulse  via  unit  delays  added  on  the 
r.'fcrcncc  v.aveguide.  Address  decoding,  which  could 
he  a  bottleneck  with  traditional  addressing  mechanisms, 
is  done  tlirujgh  the  detection  of  a  coincidence  at  the  des¬ 
tination. 

Define  a  packet  to  be  a  collection  of  informauon 
including  a  message  frame,  an  address  Irame  and  a 
reference  pulse.  Let  P  be  the  length  of  a  packet  in  time 
units  and  D  be  the  spatial  separa.uon  of  any  two  adjacent 
processors  on  an  opucal  bus.  Because  an  address  frame 
length  IS  .V  time  units,  we  ha\e  the  condition  P  >S . 
With  TD.SM  communications,  if  all  pr^iccssors  transmit 
packets  siinultancr'usly,  then  the  condition  D  >P  has  to 
be  suusned  in  order  to  prevent  packet  overlappings.  This 
condiuon,  together  with  the  condioon  that  P  ,  limits 
the  system  si/.e  N .  In  addition,  the  address  frame  length 
could  be  longer  than  the  message  frame  length  if  iV  is 
large,  r  suiting  in  inefficiency.  Another  factor  that  lim¬ 
its  the  system  size  relates  to  the  power  distribution. 
Specifically,  the  system  size  is  limited  by  the  minimum 
powers  that  can  be  detected  at  the  last  processor  in  a 
linear  system  l-k). 

Because  of  these  shortcomings  in  a  linear  .system, 
we  pritposo  a  two  dimensional  n  >n  array  called  ASO.S 
for  Array  structure  with  Synchronous  Opucal  Switches. 
In  Secuon  2,  we  present  the  system  configuration  of  the 
proposed  aS(JS.  Wc  afso  show  how'  communications  in 
the  ASOS  are  done  with  .DDM,  TDSM  and  the  combi¬ 
nation  of  the  two.  In  Secuon  3,  wc  give  analysis  and 
simulation  results  that  relate  to  communication  efiec- 
tivcness  of  the  system.  In  Section  a.  we  address  the 
potential  synchroni/ciuon  problem  and  propose  a  global 
clock  distribution  model  as  it  relates  to  the  packet  size 
lianiaiion.  .And  finally  m  Section  5.  wc  determine  the 
m.  riis  of  the  design  and  conclude  the  paper. 
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Figure  3.  System  configuration  of  the  ASOS 
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Figure  4.  Connections  of  a  sw  itch 


2.  Array  structure  with  synchronous  optical  switches 

2.1.  Ss stem  configuration 

The  'S'icm  configuration  ol  an  Array  siruciure 
with  Syni.hronous  Optical  Switches  lor  A.SO.'i)  is  diown 
in  f  I  cure  :  Pri>eessnrs  m  the  .AS(  >'•>  are  connected  wnh  a 
set  of  folded  hon/onial  (row  i  buses  and  \Lnieai 
lei  lumni  buses,  each  consisting  of  ihrce  waveguides  as 
in  a  linear  system.  All  row  buses  arc  assumed  to  be 
identical,  so  are  all  column  buses.  During  the  course  of 
the  follow  ire  prescniauon.  rows  are  numbered  from  top 
to  b.'ttom  and  columns  and  processors  are  numUred 
from  leli  to  ncht 


A  processor  can  transmit  on  the  upper  segment  ol 
a  row  bus  while  receiving  from  the  lower  segmeni  ol  a 
row  bus  and  the  right  segment  of  a  column  bus  i.nn- 
(.urrenUy.  As  shown  in  Fieiirc  .L  an  optical  switch  is 
placed  at  each  micrsecuon  of  the  lower  segment  ol  a  row 
bus  and  ific  leli  segment  ol  a  column  bus.  The  swiish 
connects  a  row  arai  a  column  bus  as  s.hown  in  Ficwc 
d'lW.  hach  switch  is  a  2  -  2  optical  device  whkh  can  K'  in 
one  of  the  two  states  :  straight  or  cross,  as  shown  iti  Fu  - 
u'c  -(  If  a  switch  IS  m  the  straight  stale,  a  message 
packet  priipagaling  on  a  row  bus  will  eonimue  propagai- 
ing  on  i!  However,  if  a  switch  is  in  the  cross  stau.  a 
message  pa-kei  propagating  on  a  row  bus  will 
swc.h.cd  (vver  lo  a  Coluii.n  bus,  .Several  kinds  ot  opii-ul 
sw.  ,f..-s  Cun  be  u'cd  lor  tins  purjs'se  !1.2,  ..;id  swiieh 
loniri'i  IS  straighilorward  in  ASOS.  as  will  K'  discussed 


m  the  next  section.  A  global  clock  is  used  tor  synchron- 
ition  parr^^scs  and  il  assumed  that  a!’  processors  and 
■-.ui.hs  receive  idcnucal  copies  of  ihc  synchronization 
J  vk.  Hoa  synchronizations  can  be  rciaini.d  without 
me  above  assumption  is  discussed  in  Section  4.  Note 
that,  synchronizing  the  communication  structure  docs 
not  mean  that  the  processors  have  to  execute  in  a  syn¬ 
chronized  mode.  Each  processor  may  execute  at  its  own 
pace  and  submit  messages  to  the  communication  struc¬ 
ture  independently.  The  delivery  of  messages,  however, 
IS  controlled  bv  a  synchronized  structure. 

2.2.  Row  and  column  communications 

Communication  between  processors  at  the  same 
r.'  v,  which  we  call  row  communicauon,  u.se  each  row 
bus  for  transmuting  and  receiving  messages  with 
sw  Itches  set  to  the  siraicht  state.  Since  packeLs  can  not 
be  transmuted  directly  on  column  buses,  communication 
between  proccs.sors  at  different  r.iw,  which  we  call 
column  communication,  use  both  row  and  column  buses 
with  switches  connecting  them  set  to  the  cross  state. 
Row  communication  and  column  commjnicauon  altcr- 
n.t'c,  m  what  wc  call  row  phases  and  column  phases 
rc'[Vctivciy,  Switches  alternate  their  two  suites  accord- 
inglv.  All  switches  arc  set  to  the  same  suite  simuluine- 
ously. 

Let  the  opucal  path  length  between  any  two  adja¬ 
cent  prcKCSsors  (or  two  adjacent  sw  itches)  on  a  row  bus 
be  D  units  long  In  order  to  allow  simuluineous 
transmissions  or  sw uchings  of  packets  without  overlap¬ 
pings,  wc  require  D>P  \Ve  further  lei  the  folded  opu¬ 
cal  path  length  on  a  row  bus  be  D  units  (see  Figure  5). 
Let  I  =  iln  -  \)D  be  the  cnd-io-cnd  propagation 
delays  of  a  row  bus.  Similarly,  let  the  optical  path  lenglh 
l  etween  two  adjacent  pnvessors  (  or  two  adjacent 
switches  I  on  a  column  bus  be  D  units  long  ar.d  the  end- 
lo-cnd  propagation  dclav  s  of  a  column  bus  be  /  units. 


Figure  5.  Onginaung  an  imagined  tram 

Timc-division  multiplexing  in  the  AWi.S  c  n  be 
h.  I  c'p'. lined  Using  ih,  I’-co.  I''.i.lin’c  mixlel  dcs>.ribcd 
K’lew.  bach  timc-sloi  is  f'  u'  its  fine  arid  is  a 

r  .  M  I  (PSt  Imagine  ihai  a  train  of  n  pa.kei  ioi-  is 
orign.j;.,!  on  a  row  bu-.  a'  shown  in  Fic.rc  When  the 


beginning  of  the  train  is  at  processor  n ,  a  commumea- 
lion  phase  begins.  Packet  slots  in  a  tram  are  numbered 

as  n  ,  n-1 .  1  from  left  to  right.  Two  adjacent  pack  t 

slots  in  a  train  arc  separated  by  D  units.  That  is,  there  is 
a  gap  of  D  -  P  units  between  two  slots.  W'e  call  a  uain 
onginaicd  at  the  beginning  of  a  row  phase  or  a  column 
phase  a  row  train  or  a  column  train  respeclively.  Note 
that  column  buses,  switches  and  taps  are  omitted  from 
the  figure.  Further,  a  row  or  a  column  tram  should  be 
regarded  as  three  separate  trains  on  the  message,  select 
and  reference  waveguides  respectivelv.  Ncvcnhcless, 
when  there  is  no  confusion,  we  will  view  ihc,sc  three 
trams  as  one  on  cither  a  row  or  a  column  bus. 

Assume  that  a  tram  is  originated  at  the  beginning 
of  a  communication  phase  (at  time  r,).  Let  P  Arrli ,  p  ) 
be  the  time  that  the  PS,  arrives  at  processor  p  on  the 
upper  segment  of  the  bus.  We  have 

P  Arr  (i ,  p)  =  tr  +  {n  -  i)D  +  (n  -  p)D 

^tr  +  T  ~(1  +p  ~\)D  (2.1) 

The  TDS.M  is  u.scd  in  a  row'  phase.  Each  P,S,  in  a 
row  aam  is  assigned  as  a  iransmitung  slot  to  processor  t . 
Each  processor  can  send  out  one  packet  during  each  row 
phase  and  each  packet  contains  unary  coincident  pulse 
addressing  information.  More  specifically,  let  u  be  the 
lime  when  a  row  phase  begins,  processor  i  transmits  a 
packet,  if  il  has  one,  at  lime  P  Arr  (i ,  i ).  That  is,  as  the 
train  propagates  through  the  upper  segment  of  a  row'  bus, 
processor  i  loads  PS,  with  its  packet. 

One  advantage  of  using  TDSM  is  that  a  processor 
can  receive  more  than  one  message  frames  in  a  single 
row  phase,  a  capability  which  is  usually  rclerrcd  to  as 
m-io-]  communications.  Another  advanuge  wufi 
TDSM  IS  lh..i  a  pmccssor  can  send  out  a  message  frame 
to  several  di  sunaiions  in  a  single  row  pha^e  cflicicntlv ,  a 
capability  which  is  usually  referred  to  as  muliicasiing 
This  IS  done  b\  multiplexing  several  select  puKes,  cacn 
corresponding  to  one  destination,  in  an  addri.Nv  frame 
II.T20I, 

Since  a  tfain  is  nD  units  long,  the  last  packet  slot 
of  the  tram,  namelv  PS\.  leaves  processor  n  on  tfic 
upper  segment  at  the  time  ir  ^  nD  .  At  that  ume,  the  rtn<. 
phase  ends  and  a  column  phase  bv'gins  with  a  column 
train  ongmaicd.  In  a  column  phase.  TDDM  is  used 
Each  PS,  of  a  column  train  is  assigned  as  a  receiving 
slot  to  the  I -ih  switch  at  a  row  bus  That  is  a  packet 
iransmiiicd  dunng  PS,  is  to  be  switched  to  column  bus  i . 
II  more  than  one  prcscssor  want  to  send  packets  to  the 
same  column  bus,  packet  slvH  contention-  wii!  (Keur. 
Su.h  Cvintcntions  arc  sobed  b\  uMng  packet  slot  reser¬ 
vation  s,  homes  as  will  K’  distio-id  later  For  now, 
assume  that  reservations  of  cverv  packet  slots  have  been 


done  and  thcretore  only  one  processor  will  transmit  a 
packet  during  any  specific  packet  slot.  Note  however, 
that  a  processor  could  send  several  packets,  each  to  a 
dilf. '.nt  column  bus,  if  it  has  reserved  the  correspond¬ 
ing  packet  slots. 

During  a  column  phase,  each  processor  loads 
packet  slots  that  it  has  reserved  while  the  tram  pro¬ 
pagates  through  the  upper  segment  of  a  row  bus.  More 
specifically,  if  processor  p  has  reserved  PS,,  it  will 
transmit  a  packet  which  is  to  be  switched  to  the  column 
bus  i  at  lime  P_Arr{i  ,p  ).  T  units  after  a  column  phase 
begins,  each  packet  slot  of  a  column  tram  will  arrive  at 
iLs  e.irresponding  switch  simultaneously  as  shown  m 
Fifiure  6.  Every  switch  is  set  to  the  cross  state  for  P 
units  to  switch  a  packet  over  to  a  column  bus,  with 
which  the  destinauon  processor  of  the  packet  is  con¬ 
nected,  I'hat  IS.  if  a  column  phase  begins  at  time 
(f  =  tr  +  nU ,  then  all  switches  w  ill  be  in  the  cros^  state 
during  the  ume  period  from  u  r  lo  +  T  +  P .  Note 
that,  w-e  have  assumed  a  negligible  switching  time.  In 
reality,  the  switching  time  ranges  from  a  few  hundreds 
ot  picoseconds  to  a  few  nanoseconds.  If  S  is  the  time 
needed  for  a  switch  to  change  from  one  state  to  the 
other,  It  IS  sufticient  to  let  D  =  /*  +  ,S'  and  let  switches 
start  switching  to  the  cross  state  at  the  time  +  7  -  .<> . 
Ej,h  sw  itch  will  stay  in  the  cross  state  lor  only  P  units 
and  start  switching  to  the  .straight  state  at  the  ume 
1  ^1  ^P.  Note  that,  in  a  non  SIMD  envu-onment, 
TDDM  has  the  advantage  of  timely  and  orderly 
dclivcrying  mes.sages  to  passive  destinations,  such  as 
switches,  without  address  information. 


,olunn  txia  I  colwrin  '  column  bu)  n 


^S'  S'fctu.Vs 

Figure  6.  Simultaneous  switchings  of  packet  slots 

Before  a  column  train  is  switched,  a  new  row 
phase  begins,  A  column  phase  ends  as  soon  as  the  last 
pa.  kei  slot  of  a  sulumn  tram,  namely  PS,,  leave-  pro 
^e  'r  n  on  the  upper  segment  TTiai  is,  a  tidumn  phase 
al'O  la-cs  un,i  time  .A  row  train  will  be  originated 
a  1 .1  t'ropag.'ting  on  the  upper  segment  while  a  eulumn 


train  is  propagaung  on  the  lower  segment.  Note  that  the 
beginning  of  PSn  of  the  row  train  is  also  D  -  P  units 
away  Irom.  the  end  of  PS ;  of  the  preceding  column  uain. 
Therefore,  even  with  r.on-negligible  switching  time  S, 
all  swiiches  can  be  set  back  to  the  .'iraight  state  while  the 
row  tram  propagates  on  the  lower  segment  of  a  bus. 

When  switches  on  a  row  bus  are  set  to  the  cross 
state,  switches  at  other  row  buses  are  also  set  to  the  cross 
state  simultaneously.  Since  two  adjacent  switches  on  a 
row  bus  are  D  units  apart,  packet  switched  from  dif¬ 
ferent  row  buses  will  not  overlap  with  each  others  on  a 
column  bus.  Rather,  these  packets  will  form  a  main  on 
the  left  segment  of  a  column  bus,  as  shown  in  Figure  7. 


column  bm  / 


Figure  7.  A  packet  tram  formed  on  a  column  bus 

After  a  packet  is  switched,  it  will  lca\c  Lhc  left 
segment  and  propugak  on  the  right  segment  of  a  column 
bus  by  the  time  sw  itches  arc  set  to  the  cro^s  state  again 
That  Is,  no  packets  or  any  column  bus  will  be  switched 
back  in  a  row  bus  Every  p,ickcl  contains  an  addre-^ 
frame  cnccxicd  using  coincident  pulse  le.tmiques  and  a 
coincidence  will  occur  at  the  destination  processor,  as 
will  be  described  in  the  next  subsection.  In  essence, 
TDDM  IS  used  to  sw  itch  a  packet  onto  a  proper  column 
bus  and  the  destination  processor  on  that  column  nus  is 
ihen  addressed  using  the  coincident  pulse  techniques. 

2.3.  Coincident  pulse  addressing 

n.eiai.'ried  ;n  the  previous  section,  ci'imd^n! 
P'lI'C  aldressire  are  used  in  both  row  and  column  e.m  - 
nniniCeUur.s.  ;  .  .h  row  bus  is  similar  to  a  bus  used  m  a 
linear  system  refci  to  Figure  I  and  Fii:wc  Each 


column  bus  can  be  viewed  as  a  row  bus  rotated  90‘ 
degree  anticlockwise.  That  is,  on  the  right  (receiving' 
segment  of  a  column  bus.  one  unit  delay  is  added 
between  any  two  adjacent  processors  on  the  message 
waveguide  and  the  retercnce  waveguide. 

■A  packet  in  a  row  tram  propagates  only  on  a  row 
bus,  that  IS,  its  reference  pulse  will  encount  added  delays 
only  on  the  row  bus.  To  cause  a  coincidence  of  the 
reference  pulse  and  a  select  pulse  at  proccs.sor  j  of  that 
row  bus,  the  relative  transmission  times  of  the  two 
pulses  should  satisfies  equation  fl.l).  A  packet  in  a 
column  train,  however,  propagates  on  a  row  bus  and  a 
column  bus.  A  reference  pulse  of  such  a  packet  will 
encounter  additic  ial  unit  delays  on  the  right  half  seg¬ 
ment  of  that  column  bus  bctorc  arriving  at  a  destination 
processor.  More  specifically,  if  a  packet  in  a  column 
train  is  destined  to  a  processor  at  row  i  and  column  j,  its 
reference  pul.se  will  first  encount  j  added  unit  delays  on 
a  row  bu'-  and  then  n  -  i  added  unit  delays  on  a  column 
bus.  Therefore,  to  cause  a  coincidence  of  the  reference 
pulse  and  a  .select  pulse  at  that  destination,  the  relative 
transmission  times  of  these  two  pulses  should  satisfy  the 
follow  mg  equation  : 

Iwi  =  tr,f  -  j  +  (n  -  I )  (2.2) 

Since  we  have  1  <  ;  +  (n  -  i )  <  2«  -  1,  the  length 
ol  an  address  frame  should  be  2/i  -  1  units  long  and 
therefore  D  >  P  >2n  -  \.  Note  that,  based  on  the  row 
and  column  number  of  dcsunation  processors,  each  pro¬ 
cessor  can  select  appropriate  packet  slots  to  send  packets 
wiih  the  address  informauon,  and  routings  are  accom¬ 
plished  through  propagations  and  possible  switchings  of 
packet  slot  trains 

2  4.  Packet  slot  reservation  scheme 

,As  mentioned  earlier,  when  loading  a  column  tram 
using  TDDM.  packet  slot  reservauons  are  required  to 
resoKe  contentions.  Re  ervations  can  be  made  con¬ 
currently  witli  message  transmissions  using  separate 
folded  waveguides,  which  we  call  reservaunn 
wuvf’gtudc,  one  at  each  row .  as  in  Figure  8. 

reset  V  ji  It ’(1  Ajvc^ijulc  4 
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Figure  8  ,A  reservation  waveguide 

There  ore  n  packet  sK'ts  to  be  reserved  for  each 
C(  ',.n:n  toi  n  bcft're  the  tram  is  loaded  in  a  column 


phase.  To  reserve  packet  slots  for  a  column  tram,  a 
corresponding  tram,  which  we  call  a  reservation  irai'i, 
IS  originated  on  a  reservauon  waveguide.  The  reserva¬ 
tion  train  also  con-isis  of  n  rc.scrvation  slots  with  the 
same  separations  between  two  slots  as  in  a  column  tram. 
Each  of  the  reservation  slot  in  a  reservauon  uam  is  used 
to  arbiinte  the  reservauon  of  the  corresponding  packet 
slot  in  a  column  train.  The  lime  period  from  the  origina¬ 
tion  of  a  reservation  uain  to  the  end  of  reservation 
operations  on  its  n  reservation  slots  is  called  a  reserva¬ 
tion  cycle.  A  reservations  cycle  can  overlap  with  a  row 
phase  and  a  column  phase  as  long  as  the  reservation 
results  will  be  available  in  a  corresponding  column 
phase. 

Three  reservauon  schemes  have  been  studied.  The 
simplest  reservation  scheme  is  the  linear  prionty  scheme 
in  which  processor  upstream  arc  given  higher  priorities 
than  processors  downstream.  When  competing  for  a 
reservation  of  a  packet  slot,  the  processor  that  has  the 
highest  priority  among  all  competing  processors  will 
succeed  while  others  w  ill  fail.  For  example,  in  Figure  8, 
processor  n  is  given  the  highest  priority  and  processor  1 
the  lowest  priority.  After  a  reservauon  cycle  begins, 
each  proce.ssor  monitors  reservation  slots  of  a  reserva¬ 
tion  train  propagating  on  the  upper  segment  of  the  reser¬ 
vation  waveguide.  If  processor  p  wants  to  re.serve 
packet  slot  i  ,  it  transmits  pulses  to  the  left  while  also 
detecting  pulses  coming  from  the  right  dunng  reserva¬ 
tion  .slot  1.  If  no  pulses  are  detected  in  that  period,  pro¬ 
cessor  p  has  succeeded  in  making  a  reservation  for  PS, 
since  no  processors  unsueam  have  attempted  to  reserve 
it.  Otherwise,  processor  p  fails  and  a  processor  upstream 
has  succeeded  in  reserv  mg  PS, .  Reservation  operations 
w  ill  finish  as  soon  a.s  the  tram  leaves  the  upper  segment. 
Assuming  that  a  reservation  slot  equals  to  a  packet  slot, 
a  reservauon  cycle  will  uike  7  units. 

The  problem  of  the  simple  linear  pnoriiy  scheme 
is  the  possibility  of  starvations.  Processors  downsueam 
with  lower  prioriues  m  ly  be  indefiniteh  bUvked 
f>ccauvc  some  hieher  prioriiy  procesvors  at  the  nghi  keep 
making  reservations.  In  the  resuained  linear  priority 
scheme,  linear  prioriues  among  prtKessors  are  still 
enforced  but  we  require  a  prnrevsor  that  succeeded  in 
reserving  a  packet  slot  m  a  previous  cycle  to  refrain 
from  making  another  reservation  for  the  same  packet 
slot  until  alter  an  idle  cycle  in  which  no  prcKcssors 
reserve  that  slot.  Therefore,  any  processor  would  be  ab’e 
to  succeed  m  making  a  rcservatinn  lor  a  p.ickei  sliu 
within  n  cycles  and  no  starvations  arc  therefore  possible 

Nevertheless,  u-inc  this  scheme,  a  higher  priority 
processor  still  has  more  chances  of  success  than  lower 
pnorily  processors  when  competing  for  rtservauons. 
The  third  scheme  lulls  employs  round-robin  lor  cyclic 


(3.4) 


polling  (Ml  I  A  processor  currcniJy  with  ihe  highest 
pnoriiy  for  a  packet  slot  among  all  competing  processors 
will  succeed  in  making  a  resersation.  In  the  next  reser¬ 
vation  cycle.  It  will  become  the  lowest  priority  one  when 
eompcting  for  the  same  packet  slot.  Meanwhile,  the  pro¬ 
cessor  with  the  priority  next  to  the  one  succeeded  in  this 
last  cycle  will  have  the  highest  priority  and  all  other  pro¬ 
cessors  will  adjust  their  pnonties  accordingly  in  a  cyclic 
order. 

Implementation  issue  of  the  three  reservation 
schemes  are  discussed  m  detail',  in  119],  which  also 
incluJes  simulation  results  regarding  the  fairness  of  the 
three  reservation  schemes. 

3.  System  bandwidth  and  packet  delays 

Let  be  the  maximum  transmission  rate  at 
which  a  processor  can  drive  an  optical  bus.  Then  tf.e 
maximum  bandwidth  of  a  row  bus  is  and  the  max¬ 
imum  bandw  idth  of  the  n  y.n  ASOS  is  nB  Since  dur¬ 
ing  each  packet  slot  which  is  D  =  B  S  units,  at  most  P 
unit  messages  are  transmitted  Therefore,  the  maximum 
efficiency,  6,  is 

^=-p-^  (3.1a) 

And  the  maximum  bandw  idth  achievable,  B^ ,  is 

fla=n>cfi„„x©  (3.1b) 

Assume  that  on  average,  each  processor  generates 
L.  packets  during  each  row  phase  and  packets  during 
each  c(  luinn  phase,  where  0<LrJ^c  S  1.  Denote  the 
average  number  of  packets  generated  per  packet  slot  by 

Lr.  Then  L,  =  .  The  average  throughput,  or 

eftectivc  bandwidth  in  a  row  phase,  B, .  is 

Br^LrXB^  0.2) 

It  a'.'ume  that  the  destinations  ol  generated  packets 
are  umfermlv  distnhuied  among  n  column  buses,  the 
r”uui>nun:  elleetive  bandwidth  m  a  column  phase.  Be. 
mav  be  upprcx.-'iiated  by 

B:=L,xn,  (3„f) 

Not,’  that  this  maximum  bandwidth  in  a  column  pha.sc 
m,  \  not  be  achievable  due  to  packet  slot  contentions. 
Therefore,  given  fixed  average  communication  loadL,. 
m, .  .  .ngs  uhai  distribute  more  loads  to  row  communica¬ 
tions  and  less  loads  to  column  communicauons  will 
m  or  \e  the  effective  system  bandwidth. 

B\  uif,.ng  thic  average  of  euiiaUun  (.3.2|  and  equa¬ 
tion  I  *  i'o  the  etfeciivc  system  band -v idth  of  an  .ASOS. 
dcnoi  d  by  fl, ,  is 


^  Br+Bc  _  n{Lr  Lc)B^,^eP 
be  -  ,  5 ) 

Note  that,  from  equations  (.el;,  the  switching 
speed  S,  relative  to  the  packet  length  P ,  largely  deter¬ 
mines  the  system  efficiency  and  bandwidth. 

Packet  slot  contentions  in  column  communications 
not  only  can  decrea.se  the  effective  bandwidth  but  also 
can  introduce  delays  for  packets.  In  a  column  phase,  a 
processor  will  not  be  able  to  send  out  a  packet  unless  it 
has  succeeded  in  reserving  the  corresponding  packet 
slot.  The  packet  that  can  not  be  sent  out  due  to  an 
unsuccessful  reservation  has  to  be  delayed  until  a  future 
column  phase.  Define  packet  delays  to  be  the  number  of 
column  phases  that  a  packet  has  been  delayed  due  to 
unsuccessful  reservations  of  the  corresponding  packet 
slot.  Further,  assume  that  during  each  column  phase,  the 
number  of  packets  that  each  proces.sor  generates  is  a 
poison  process  with  mean  rate  X  where  X  <  1.  The  desti- 
nallOIi^  of  these  packets  a-e  evenly  distributed  among  n 
columns.  That  is,  on  the  average,  out  of  X  packets  gen¬ 
erated  by  a  processor  in  each  column  phase,  only 

X,  =  ~  packets  are  destined  for  colamn  bus  i . 
n 

We  first  examine  the  average  packet  delays  using 
the  round-robin  reservation  schcrr.c.  A  natural  analyuc 
model  to  use  is  the  model  with  multiple  queues  and  a 
single  server.  Each  proce.ssor  is  viewed  as  a  station  w  ith 
an  ideally  infinite  queue.  The  server  .selects  a  station  to 
serve  in  the  round-robin  fashion.  The  service  ume  is  a 
constan'  unit  time  (which  is  one  column  phase).  In  (231, 
a  similar  model  is  analyzed  where  the  reply  intcr\al. 
defined  as  the  time  for  a  .server  to  switch  from  one  sta¬ 
tion  to  another,  is  assumed  to  be  non-zero.  For  models 
with  zero  reply  interval  such  as  ours,  only  approximate 
analysis  are  given  [12],  However,  if  we  view  the  above 
model  with  multiple  queues  as  a  single  queue,  D  1 
nuxlel  wiih  the  FCFS  (First-Come  Firvi-.Scrve)  p^iluy, 
wc  will  have  the  same  result  for  average  packet  del.ivs 
for  bi'th  models.  In  fact,  with  the  same  assumptions 
about  message  arrival  rate  and  service  rat:  in  boilt 
mcKlels,  the  siausiical  charactensucs  ol  the  unfinished 
work  in  both  models  should  be  ideniicaL  In  other  words, 
the  total  number  of  packets  remaining,  hence  the  queue 
length  in  b'th  mixlels,  is  independent  of  the  service  pol¬ 
icy  u.sed.  It  fohows  from  t.’ie  Little's  law  |I41  that  the 
average  packet  delays,  denoted  by  F,.  is  also  indepen¬ 
dent  of  the  policy  used.  Tf.ai  is.  the  average  packet 
delays  using  n  und-robin  scheme  should  be  the  same  as 
that  of  a  Af  D  I  model.  Namely  1^.  Kd, 

I  rtX,  _  I  ,  .  ^ 


Ii  IS  also  siraighitop.vard  lo  have  the  same  analysis 
result  lor  average  puekct  delays  using  linear  pnonty 
scheme.  Hovvever,  we  are  unable  to  analyze  average 
packet  delays  using  the  restrained  linear  pnonty  scheme. 
Fiiiure  9  shows  si  aulauon  results  of  average  packet 
delays  vs.  arrival  rates  X  for  all  three  different  reserva¬ 
tion  scher  es.  We  note  that  th>  results  shown  for  both 
round  robin  and  linear  pnonty  schemes  conform  to  the 
thecrciuaJ  values  as  outlined  m  equauon  (3.5).  How¬ 
ever,  the  restrained  linear  pnonty  scheme  has  longer 
average  packet  delays  since  some  idle  cycles  arc 
aniiicially  introduced  in  this  scheme. 

£cp  L" 


Figure  y.  Average  packet  delays 


If  cither  the  linear  priority  or  the  round-robin 
scheme  is  used,  average  packet  dcla's  in  columi.  ctim- 
municatuins  are  only  about  ivui  column  phases  when  the 
load  reaches  sn''-  of  the  capacity  (this  is  equivalent  to 
the  case  when  \  =  0.8).  Note  however,  that  according  to 
the  definition  of  the  packet  delays,  packets  m  row  com¬ 
munication  have  zero  packet  delays. 

4.  (  lock  distribution  and  packet  size  limitation 

One  of  the  maior  concerns  in  designing  a  large 
S',  siem  IS  ihe  svnchronization  problem.  So  far.  we  have 
assumed  that  all  processors  and  switches  arc  connected 
a  global  cUkR  with  separate  waveguides  (called  clock 
waveguides)  of  equal  length.  Therefore,  for  communica- 
ti'ni  purposes  all  prexes'ors  and  switches  vinually  share 
an  identical  global  ume.  .More  spcsiti^ally.  let  Ig  K'  a 
g'  '“sal  umc,  and  let  Pf;  (r ,  r  )  and  STi  ir, )  be  a  kxal 
tin:,  of  me  prvicessor  the  switch  rC'pev.ivelv.  a:  row 
r  ..  ,1  eol.in  n  c  \k'e  have  as'umed  that  at  any  insuinvc, 
P  / c '  .  c  1  -  V 7 c  1 1- ,  f  )  =  /',  -s  C  fi'r  ail  r  and  e  .  w  here 
C  is  a  •  I'n-  tarit  Wc  call  .'uch  a  ckxk  d/striouiion  model 
id-  nui  al-U’r.i  mixlel 


For  example,  consider  a  system  with  two  nodes, 
denoted  by  ni  and  zij.  Let  Tl(1)  and  Tl'-)  be  their 
local  ume  respectively.  To  perloim  an  operation  simul- 
laneously  at  a  given  instance  Tg ,  each  node  starts  the 
operation  when  its  local  time  equals  .  Here,  an  opera¬ 
tion  could  be  the  uansmission  of  a  packet  if  the  nodes 
are  processors,  or  could  be  the  change  from  one  state  to 
the  other  if  the  nodes  arc  switches.  In  the  idenucal-time 
model  shown  in  Figure  10(a),  two  separate  clock 
waveguides  with  equal  length  arc  used  to  connect  two 
nodes  with  the  global  clock.  Since  7'z.(l)  is  always  ident¬ 
ical  to  Tl(2),  the  two  nodes  will  stan  at  the  same  ume  to 
perform  a  simultaneously  operation  on  two  packets.  If 
these  two  nodes  are  separated  by  D  units  on  a 
waveguide  and  a  pacKci  nas  a  length  of  P  units,  the  con- 
diuon  P  <  D  IS  necessary  to  prevent  tbe  two  nodes  from 
performing  the  same  operation  on  the  same  packet.  For 
instance,  if  ihc.se  two  nodes  are  processors,  the  above 
condiuon  prevents  overlappings  of  transmitted  packets. 
If  these  two  nodes  arc  switches,  the  abr  e  condition 
prevents  a  packet  from  being  parually  switched  by  two 
switche.s. 


1  .  *  1 


0  —0 

•  a*  U1T1C  nvxJci  skewed  ume  mode! 

Figure  10,  Two  clock  distribuuon  mixJcls 

Another  model,  which  wc  call  si-.i  model, 

IS  shown  in  fji’ure  10ih>.  Only  one  chxk  waveguide  is 
used  to  connect  these  two  nodes  to  the  global  ekxk.  It  is 
assumed  that  clock  pulses  and  packets  propagate  in  the 
opposite  direcli'-  is.  Assume  that  the  propagation  d  lays 
between  n-  and  ts  on  the  ckxk  waveguide  are  d  units, 
we  have  7^  (1 )  -  Ii  (2)  ^  d .  if  the  two  nc/dcs  are  to  per¬ 
form  an  uperauon  simultaneously  according  lo  ineir 
kxal  limes,  then  the  node  zii  will  start  d  units  earlier 
than  the  notie  .'i?.  Thcrelore,  these  owi.  ntxics  will  not 
perform  the  operation  on  the  same  pacKCt  as  long  as 
P  <  D  *  d .  That  is,  given  a  tixcd  D  .  the  packet  size  ca.n 
be  increased  by  d  units  using  the  skcwed-timc  mivlel. 
.Note  that,  if  clock  pulses  and  patkets  prop.;. 'ate  in  itie 
'..imc  dirctuon.  the  packet  size  will  have  to  be 
decreased.  We  arc  inicres'ctf  in  increasing  the  packet 
size  for  a  given  D  to  overcome  eurrcni  tcchntilogv  res¬ 
traints  on  ihv  paKe  wulih. 


In  order  to  have  a  regular  connection  between 
switches  and  the  global  clock,  the  skewed-iime  m  del  is 
used  to  distribute  the  global  clock  to  switches  in  ASOS, 
as  shown  in  Figure  11.  Clock  pul'^cs  always  propagate  in 
a  direction  opposite  to  that  of  packet  propagation.  For 
example,  packets  propagate  on  the  lower  segment  of  a 
row  bus  from  left  to  right.  But  at  any  row  bus  r,  the  fol¬ 
lowing  equation  holds ; 

STL(r,i)  =  STL{r,i-\)  +  d  (4.1) 

that  is,  clock  pulses  propagate  from  right  to  left.  Sup- 
po.se  that  switch  i+1  has  just  been  set  to  the  cross  slate 
when  F5,+i  arrives.  The  beginning  of  PS,  is  D  units 
away  from  that  of  M.+i  and  D  -  D  =  d  units  away  from 
switch  i.  After  d  units,  sw'iich  i  will  be  set  to  the  cross 
state  and  PS,  will  arrive  at  switch  i  as  desired.  In  addi¬ 
tion,  switched  packets  propagate  top  down  on  the  left 
segment  of  a  column  bus  but  since 
S'li_ir,c)  =  STL{r-'i,c)-*-d,  clock  pulses  propagate 
bottom  up  at  each  column.  Therefore,  no  packet  overlap¬ 
pings  are  possible  on  any  column  bus. 


To 


processers 


To  iwiichcs 
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Ckilui  Clock 


Figure  1 1.  Control  switches  with  skewed-iime 


Vv  e  can  similarly  connect  all  processors  to  the 
^ani:  global  clock  using  the  skewed-time  model.  Note 
that,  proc  essors  trtinsmit  their  packets  on  the  upper  seg¬ 
ment  of  row  buses.  Thc.se  packets  propagate  from  right 
to  left  at  each  row  and  therefore  clock  pulses  should  pro¬ 
pagate  from  left  to  right  at  each  row. 

It  is  possible  to  recalculate  equation  (2,1)  for  the 
skewed-ume  model  in  terms  of  the  Uxtal  time  of  each 
processor  so  that  communications  and  re.scrvations  can 
be  carried  out  as  described  in  Section  2  and  Section  3 
respectively  [191,  Note  that,  this  skcwcd-limc  model  can 
a  )  be  applied  in  other  time  d. vision  multiplexing  or 
;  ipelining  systems. 


5.  Concluding  remarks 

It  has  been  established  that  much  highei 
bandwidth  can  be  achieved  wiili  pipciiued  optical  bu- 
intcrconnecttons  than  with  electronic  exclusive  bus 
access  communications  [7, 161.  The  high  bandwidth  of 
optical  waveguides  can  also  be  achieved  using  two  novel 
time-division  approaches,  namely  TDSM  and  TDDM. 
These  two  approaches  are  used  in  the  design  presented 
in  this  paper. 

In  addition  to  achieving  high  bandwidth,  several 
other  design  goals  arc  met.  One  is  the  simplicity  of  the 
hardware  structures  and  controls.  Only  one  switch  is 
used  for  each  row  and  each  column  bus  and  switch  set¬ 
tings  arc  performed  uniformly  and  synchronously. 
Another  related  goal  is  the  feasibility  and  the  flexibility 
of  the  design  given  the  available  technologies.  For 
example,  with  current  technology,  it  is  possible  to  drive 
an  optical  bus  at  20  GHz  [22).  This  results  m  a  pulse 
width  w ,  or  a  time  unit  as  defined  in  this  paper,  of  50  ps 
(pico-second).  Assuming  that  the  speed  of  light  in  the 
waveguide  is  Cg  =2x10*^  m/scc,  the  .spatial  length  of  a 
pulse  is  1  cm.  With  100  ps  switching  speed  (221  and  a 
16-bil  mcs.sage  frame,  the  .switching  time  5  is  2  time 
units  and  the  packet  length  Z’  is  16  time  units  in  an  8x8 
array.  With  backplane  connections,  it  is  reasonable  to 
assume  that  the  spatial  separation  of  two  processors,  D , 
is  7  cm,  or  equivalently  7  time  units.  The  condition  to 
prevent  packet  overlapping,  namely  D  >P  ^  S  ,  thus 
not  satisfied.  Therefore,  the  skewed-time  model  should 
be  used  and  the  skewing  distance  d  should  be  at  least  1 1 
lime  units.  Note  that  if  communication  loads  arc  805)  of 
the  communication  capacity  in  both  row  and  column 
phases,  that  is,  f,r  =  Z-c  =  0-8,  the  maximum  effective 
bandwidth  for  the  8x8  array  is  about  113.8  Gb/scc 
according  to  equation  (3.4). 

When  technology  advances  to  the  stage  at  which 
optical  buses  arc  implemented  on  GaAs  wafers  w  ith  10() 
GHz  transmission  and  at  10  ps  switching  speed,  the 
skewed-time  model  may  be  no  longer  necessary.  In  fact, 
with  D  -1  cm.  a  packet  could  contain  up  to  34  bits 
without  using  the  skewed-time  model.  Nevertheless,  the 
skewed-time  model  should  be  used  if  D  is  reduced  to  1 
cm  (the  order  of  the  separation  in  chip-io-chip  connec¬ 
tions). 

We  note  ihal  our  reservation  schemes  that  relate  to 
TDDM  are  different  from  any  of  the  schemes  discussed 
in  (6).  Also,  uses  of  random  access  schemes,  such  as 
those  in  115,17),  for  resolving  umc-sloi  contentions 
would  yield  longer  average  packet  delay  s  and  lower  sv'’- 
icm  bandw  idth  in  this  partic,.lar  application  in  w  hich  the 
commumcauon  load  is  a.ssumcd  to  b  -  high.  Final  v.  we 
note  that  some  technology  issues  not  mentioned  in  this 


paper,  such  as  pulse  generations,  coincidence  detections 

and  po'Acr  a.stributions,  are  discussed  in  [18,4], 

Acknowlev’.’cment 

This  r  carch  is  supported,  m  part,  by  a  grant  from 

the  Air  Force  Office  of  Scientific  Research  under  grant 
#AFOSR-89-rM69. 

References 

1.  R,  Alfcmess,  L.  Buhl,  S.  Korotky,  and  R.  Tucker, 
"High-Speed  AP-Reversal  Directional  Coupler 
Switch,”  Topical  Meeting  on  Photonic  Sv^itching. 
Technical  Digest  Senes,  vol.  13,  pp.  77-78,  1987. 

2.  A.  Benner,  H.  Jordan,  and  V.  Hearing,  “Optically 
Switched  Lithium  Niobatc  Duectional  Couplers 
for  Digital  Optical  Computing,”  SPIE  Proc., 
Digital  Optical  Computing  //,  vol.  1215,  pp.  343- 
352.  1990. 

3  D.  Chiarulli,  R.  Melhem,  and  S.  Levitan,  “Using 
Coincident  Opucal  Pulses  for  Parallel  Memory 
Addressing,”  IEEE  Computer,  vol.  20,  no.  12,  pp. 
48-58,  1987. 

4.  D.  Chiarulli,  R.  Ditmorc,  R.  Melhem,  and  S.  Levi¬ 
tan.  “An  All  Optical  Addressing  Cacuit ;  Experi- 
mentaJ  Results  and  Scalability  Analysis,”  IEEE 
Journal  of  Lightwave  Technology.  Special  issue 
on  optical  interconnection  for  information  pro¬ 
cessing,  (to  appear). 

5.  J.  Cohen,  “The  Single  Server  Queues,”  North 
Holland.  1969. 

6.  M.  Fine  and  F.  Tobagi,  “Demand  Assignment 
Muluplc  Access  Schemes  in  Broadcast  Bus  Local 
Area  Networks,”  IEEE  Transactions  on  Comput¬ 
ers.  vol.  C-33,  no.  12.  pp.  1 130-1 159,  Dec.  1984. 

7.  Z.  Guo.  R.  Melhem,  R.  Hall,  D.  Chiarulli,  and  S. 
Levitan,  “Array  Proces.sors  with  Pipelined  Opucal 
Busses,”  Journal  of  Parallel  and  Distributed 
Computing,  vol.  12,  no.  3,  pp.  269-282,  1991. 

8.  P.  Haugen.  S.  Rychnovsky,  A.  Hu.sain,  and  L. 
Hutcheson.  “Optical  Interconnects  for  high  speed 
computing,”  Optical  Engiru’cing.  vol,  25,  pp. 
1076- 1085,  Oct.  1989. 

9.  F.  Kiamilev,  S.  Esener,  V.  Ozgus,  and  S.  Lee, 
“Programmable  Optoclecuonie  Muluproccssor 
S>  stems,"  Digital  Optical  Computing,  SPIE 
Press.,  vol.  CR35.  pp.  197-220.  July  1990. 

10.  L.  Kleinn  :k,  “Queueing  Systems,  Volumn  1  : 
Theory  ,“  John  Wiley  and  Sons.  1975. 

11  11  Kobayashi  and  Konheim,  “Queueing 

Me.JeN  for  Computer  Communications  Systems 
Analysis,”  IEEE  Transactions  on  Communica¬ 


tions,  vol.  COM-25,  no.  1,  pp.  2-29,  Jan.  1977. 

12.  A.  Konheim,  “Chaining  in  a  Loop  System,”  IEEE 
Transactions  on  Communications,  vol.  COM-24, 
no.2,pp.  203-210,  Feb.  1976. 

13.  S.  Levitan,  D.  Chiarulli,  and  R.  Melhem.  “Coin¬ 
cident  Pulse  Techniques  for  Multiprocessor  Inter¬ 
connection  Structures,”  Applied  Optics,  vol.  29, 
no.  14,  pp.  2024-2039,  1990. 

14.  J.  Little,  “A  Proof  for  the  Queueing  Formula  : 
L  =  X.”  Operations  Research,  vol.  9,  no.  4,  pp. 
204-209,  July  1961. 

15.  N.  Maxemchuk,  “Twelve  Random  Access  Sua- 
tegies  for  Fiber-Optic  Networks,"  IEEE  Transac¬ 
tions  on  Communications,  vol.  36,  pp.  942-950, 
Aug.  1988. 

’6.  R,  Melhem,  D.  Chiarulli,  and  S.  Levitan.  “Space 
Multiplexing  of  Waveguides  m  Optically  Inter¬ 
connected  Multiprocessor  Systems.”  The  Com¬ 
puter  Journal,  vol.  32,  no,  4,  pp.  362-369.  1989. 

17.  R.  Metcalfe  and  D.  Boggs,  “Ethernet ;  Disuabuted 
Packet  Switching  for  Local  Computer  Networks." 
Communications  of  ACM,  vol.  19,  no.  7,  pp.  395- 
403.  1976. 

18.  M.  Nasschi,  F.  Tobagi,  and  M.  Marhic,  “Fiber 
Optic  Configurations  for  Local  Area  Networks," 
IEEE  Journal  on  Selected  Areas  in  Communica¬ 
tion  ,  vol.  SAC-3,  no.  6,  pp.  941-949,  Nov.  1985. 

19.  C.  Qiao  and  R.  Melhem,  “Time-Division  Optica! 
Communications  In  Multiprocessor  Arrays,” 
Tech.  Rep.  91-14,,  CS  Department,  University  of 
Pittsburgh  .March  1991. 

20.  C.  Qiao,  D,  Chiarulli.  R.  Melhem,  and  S.  Levitan, 
“Optical  Mulucasiing  in  Linear  Array  s,”  Interna¬ 
tional  Journal  of  Optical  Computing,  (to  appear). 

21.  S.  Ramanan  and  H.  Jordan,  “Serial  Arrav 
Shufllc-Exehange  Architecture  for  Universal  Per- 
mutauon  of  Time  Slots,”  SPIE  Proc..  Digital  Opt¬ 
ical  Computing  II,  \ol.  1215,  pp,  330-342.  Jan. 
1990. 

22.  J.  Sauer,  “A  Multi-Gb/s  Optical  Interconnect,” 
SPIE  Proc..  Digital  Optical  Computing  II.  vol. 
1215.  pp.  198-207,  1990. 

23.  H.  Takagi,  “Analysis  of  Polling  Systems,”  .MIT 
Press,  1986. 

24.  R.  Thompson  and  P.  Giordano,  “An  Expenmcntal 
Photonic  Iimc-slot  Inierchanger  I'sing  Opucal 
Fibers  as  Reentrant  Delay-line  Memories.”  IEEE 
J  Lightwave  lecr:n<..og\  .  vol.  1,  pp.  154-102, 
1987, 


